SAS® Visual Data Mining and Machine Learning in SAS® Viya®: Interactive Machine Learning
Course Notes
SAS® Visual Data Mining and Machine Learning in SAS ® Viya®: Interactive Machine Learning Course Notes was developed by Andy Ravenna, Manoj Singh, and Catherine Truxillo. Additional contributions were made by George Fernandez and Chip Wells. Instructional design, editing, and production support was provided by the Learning Design and Development team. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indic ates USA registration. Other brand and product names are trademarks of their respective companies. SAS® Visual Data Mining and Machine Learning in SAS ® Viya® : Interactive Machine Learning Course Notes Copyright © 2020 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. Book code E71659, course code LWVDMO35/VDMO35, prepared date 10Jun2020. LWVDMO35_001
For Your Infor mation
Table of Contents Lesson 1
Introduction to SAS ® Visual Data Mining and Machine Learning..........1-1
1.1
Overview .........................................................................................................1-3
1.2
Data Exploration............................................................................................. 1-15 Demonstration: Introduction to the SAS Visual Data Mining and Machine Learning Environment......................................................... 1-19 Demonstration: Exploring Data ................................................................... 1-24 Practice ................................................................................................... 1-33
1.3
SAS Viya: Details ........................................................................................... 1-34
1.4
Solutions ....................................................................................................... 1-41 Solutions to Practices................................................................................ 1-41 Solutions to Activities and Questions .......................................................... 1-43
Lesson 2 2.1
Machine Learning Algorithms..............................................................2-1
Introduction......................................................................................................2-3 Demonstration: Partitioning Data ..................................................................2-7
2.2
Neural Networks............................................................................................. 2-10 Demonstration: Training and Exploring a Neural Network Model in SAS Visual Data Mining and Machine Learning ............................ 2-33 Practice ................................................................................................... 2-47
2.3
Support Vector Machines ................................................................................ 2-49 Demonstration: Training and Exploring an SVM Model in SAS Visual Data Mining and Machine Learning.............................................. 2-68 Practice ................................................................................................... 2-76
2.4
Forests .......................................................................................................... 2-77 Demonstration: Training and Exploring a Forest Model in SAS Visual Data Mining and Machine Learning.............................................. 2-98 Practice ................................................................................................. 2-107
2.5
Gradient Boosting......................................................................................... 2-108
iii
iv
For Your Information
Demonstration: Training and Exploring a Gradient Boosting Model in SAS Visual Data Mining and Machine Learning .......................... 2-121 Practice ................................................................................................. 2-129 2.6
Bayesian Networks ....................................................................................... 2-130 Demonstration: Training and Exploring a Bayesian Network Classifier in SAS Visual Data Mining and Machine Learning .................. 2-146
2.7
Solutions ..................................................................................................... 2-160 Solutions to Practices.............................................................................. 2-160 Solutions to Activities and Questions ........................................................ 2-168
Lesson 3 3.1
Model Assessment and Implementation ..............................................3-1
Model Assessment ...........................................................................................3-3 Demonstration: Model Comparison ............................................................. 3-12
3.2
Scoring.......................................................................................................... 3-19 Demonstration: Creating Score Code .......................................................... 3-22
3.3
Integration with Model Studio........................................................................... 3-28 Demonstration: Transferring a SAS Visual Analytics Model to Model Studio ............................................................................... 3-32
3.4
Solutions ....................................................................................................... 3-36 Solutions to Activities and Questions .......................................................... 3-36
Lesson 4 4.1
Factorization Machines .......................................................................4-1
Factorization Machines .....................................................................................4-3 Demonstration: Factorization Machines in SAS Visual Analytics .................... 4-14
Appendix A
Additional Details ............................................................................... A-1
A.1 Additional Details............................................................................................. A-3 Appendix B
References ......................................................................................... B-1
B.1 References ..................................................................................................... B-3
For Your Infor mation
v
To learn more… For information about other courses in the curriculum, contact the SAS Education Division at 1-800-333-7660, or send e-mail to
[email protected]. You can also find this information on the web at http://support.sas.com/training/ as well as in the Training Course Catalog.
For a list of SAS books (including e-books) that relate to the topics covered in this course notes, visit https://www.sas.com/sas/books.html or call 1-800-727-0025. US customers receive free shipping to US addresses.
vi
For Your Information
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning 1.1
Overview...................................................................................................................... 1-3
1.2
Data Exploration ........................................................................................................ 1-15 Demonstration: Introduction to the SAS Visual Data Mining and Machine Learning Environment .................................................................................. 1-19 Demonstration: Exploring Data ............................................................................... 1-24 Practice............................................................................................................... 1-33
1.3
SAS Viya: Details ....................................................................................................... 1-34
1.4
Solutions ................................................................................................................... 1-41 Solutions to Practices ............................................................................................ 1-41 Solutions to Activities and Questions........................................................................ 1-43
1-2
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Overview
1.1 Overview Course Description •
This course provides an introduction to SAS Visual Data Mining and Machine Learning.
•
The course is hands-on and provides direct access to an environment to experience powerful machine learning techniques via an easy to use, drag-and-drop visual interface.
3 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Visual Data Mining and Machine Learning adds modeling functionality to the SAS Visual Analytics web client. Paired with SAS Visual Statistics , it enables users to experience powerful statistical modeling and machine learning techniques running in SAS Viya through an easy-to-use, drag-and-drop visual interface.
Course Prerequisites •
This course builds upon the concepts of interactive predictive modeling using SAS Visual Statistics in SAS Viya.
•
Knowledge of SAS Visual Statistics in SAS Viya is a prerequisite to this course.
4 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-3
1-4
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
Course Objectives •
Describe the features and benefits of SAS Visual Data Mining and Machine Learning at a high level.
Train neural network models. • Train forest models. •
Train a support vector machine. • Train a gradient boosting model. •
• •
Train Bayesian networks. Train a factorization machine.
Compare model performance using honest assessment. • Export score code and score a model. •
5 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Chapter Objectives •
Discuss the SAS Viya architecture.
Describe SAS Cloud Analytic Services (CAS). • Discuss the benefits and functionality of SAS Visual Data Mining and Machine Learning. •
Access SAS Visual Data Mining and Machine Learning functionality within the SAS Visual Analytics environment. • Describe and explore the data. •
•
Work with the interface.
6 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Overview
1-5
SAS Viya Source-based Engines
In Hadoop
In-Database
Parallel and Serial, Pub / Sub, Web Services, MQs
In-Stream
Customer Intelligence
Solutions
In-Memory Run-Time Engine
Microservices Data Source Mgmt UAA UAA
UAA
Folders
BI GUIs CAS Managem ent
Risk Management
Analytics
Business Visualization
!
Data Mgmt GUIs
Fraud and Security Data Management Intelligence
Query Gen Env Mgr
Log and so on.…
Cloud Analytics Services (CAS)
Analytics GUIs
Model Mgmt Audit
APIs
Platform
7 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
At the heart of SAS Viya is SAS Cloud Analytic Services (CAS), an in-memory, distributed analytics engine. It uses scalable, high-perf ormance, multi-threaded algorithms to rapidly perform analytical processing on in-memory data of any size. SAS Viya contains microservices. A microservice is a small service that runs in its own process and communicates with a lightweight mechanism (HTTP). Microservices are a series of containers that def ine all the dif f erent analytic lif e cycle f unctions (sometimes described as actions) that f it together in a modular way. The in-memory engine is independent f rom the microservices and allows f or independent scalability. On the lef t, you see a series of source-based data engines. SAS Viya has a middle tier implemented on a microservices architecture, deployed and orchestrated through the industry standard cloud Platf orm as a Service, Cloud Foundry. Through Cloud Foundry, SAS Viya can be deployed, managed, monitored, scaled, and updated. Cloud Foundry enables SAS Viya to support multiple cloud inf rastructure, enabling customers to deploy SAS in a hybrid cloud environment that spans multiple clouds, including the combination of on-premises cloud inf rastructure and public cloud inf rastructure. You can choose to use other platf orms such as Docker and the open container initiative. You can operate on private inf rastructure such as OpenStack and VMware or open inf rastructure such as Amazon Web Services and Azure. Existing SAS solutions and new ones are being built in SAS Viya. In addition, you can use REST APIs to include SAS Viya actions into your existing applications.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-6
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
SAS Cloud Analytic Services Cloud Analytic Services (CAS) is an in-memory, distributed, analytics engine. It uses scalable, high-performance, multi-threaded algorithms to rapidly perform analytical processing on in-memory data of any size. Run-Time Environments SAS Cloud Analytic Services CAS Controller
CAS Worker
CAS Data Connector
CAS Data Connector
Application Services (Middle Tier)
8 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Cloud Analytic Services (CAS) is a server that provides the cloud-based run-time environment for data management and analytics with SAS. Run-time environment refers to the combination of hardware and software where data management and analytics occur. CAS is designed to run in a single-machine symmetric multiprocessing (SMP) or multi-machine massively parallel processing (MPP) configuration. The distributed server consists of one controller and one or more workers. For both modes, the server is multi-threaded f or high-perf ormance analytics. The distributed server has a communication layer that supports fault tolerance. A distributed server can continue processing requests even after losing connectivity to some nodes. The communication layer also enables you to remove or add nodes from a server while it is running. One of the design principles of the server is to handle large problems and to work with tables that exceed the memory capacity of the environment. To address this principle, data in the server are managed in blocks. Whenever needed, the server caches the blocks on disk. It is this f eature that enables the server to manage memory ef f iciently, handle large data volumes, and remain responsive to requests.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Overview
1-7
SAS Visual Analytics Applications SAS Drive
SAS Report Viewer
SAS Visual Analytics
SAS Visual Statistics
SAS Visual Analytics App
SAS Cloud Analytic Services (CAS) SAS Theme Designer
SAS Graph Builder
SAS Data Studio
SAS Visual Data Mining and Machine Learning
9 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Drive
Hub f or the SAS Viya applications that enables you to easily view, organize, and share your content f rom one place
SAS Report Viewer
View reports in a browser
SAS Visual Analytics App
View reports on a tablet or mobile device
SAS Theme Designer
Create custom themes f or the application or reports
SAS Graph Builder
Create customized graph objects
SAS Cloud Analytic Services (CAS)
Cloud-based, run-time environment server f or data management and analytics
SAS Data Studio
Prepare data using transf orms Note: If SAS Data Preparation is licensed, data quality transforms are available.
Visual Analytics
visualize data interactively, create interactive reports, build statistical models
Add-on Products
SAS Visual Statistics, SAS Visual Data Mining and Machine Learning.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-8
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
Model Studio might also be included in your list of applications. Model Studio is an integrated, visual environment that provides a suite of analytic tools to facilitate end-to-end data mining, text analytics and forecasting analyses. The tools that are supported in Model Studio are designed to take advantage of the SAS Viya programming and cloud processing environments. They help deliver and distribute the results of analyses such as champion models, score code, and results. Model Studio is a common interface that contains the following SAS solutions: • SAS Visual Forecasting • SAS Visual Data Mining and Machine Learning in Model Studio • SAS Visual Text Analytics The availability of the functionality in Model Studio depends on your SAS license and the permissions that are assigned to you by your administrator.
SAS Visual Data Mining and Machine Learning Programming Approach
SAS Procedures
CAS Actions
Visual Drag-and-Drop Approach SAS Studio
SAS Visual Analytics
Model Studio
10 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Visual Data Mining and Machine Learning capabilities can be accessed through v isual or programming interf ace to build models, depending on users pref erences and skills. The visual interf ace (SAS Visual Analytics) enables you to interactively explore data and build machine learning models, and another visual interf ace (Model Studio) provides an integrated environment f or the most common machine learning steps: data preparation, f eature engineering, data exploration, model building, and deployment. SAS Studio is an interactive programming interface for those who want to code in SAS or in other languages while taking advantage of powerful SAS statistical modeling and machine learning techniques. You can also use the predefined tasks in SAS Studio to generate SAS code. However, the availability of these tasks depends on what you license and install at your site. Users can programmatically access analytical actions from SAS St udio, call them f rom other languages (Python, R, Lua, Java) or use public REST APIs to add SAS Analytics to existing applications.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Overview
What Is SAS Visual Data Mining and Machine Learning? SAS Visual Data Mining and Machine Learning is an add-on to SAS Visual Analytics and SAS Visual Statistics that enables you to develop and test models using the in-memory capabilities of SAS servers. SAS Visual Data Mining and Machine Learning models • take advantage of SAS Cloud Analytic Services (CAS) to persist and analyze data in memory •
enable concurrent access to in-memory data for multiple users to formulate and refine models.
11 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Visual Data Mining and Machine Learning Here are some key features of SAS Visual Data Mining and Machine Learning: provides a single, integrated in-memory environment • performs data exploration, feature engineering, and dimension reduction •
applies modern statistical and machine learning techniques to data of any size • automates the process of determining optimum model parameters •
•
performs model assessment and puts the selected model into production using existing SAS products
12 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-9
1-10
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
SAS Drive
13 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Drive is a hub f or the SAS Viya applications, and it enables you to easily view, organize, and share your content f rom one place. SAS Drive uses the standard sign-in window f or SAS applications. To display a sign-in window, enter the URL provided by your administrator (f or example, https://prod.host.com/SASDrive). To open SAS Drive f rom another SAS Viya application, select Share and Collaborate f rom the side menu in the upper lef t. 1. Application bar enables you to access other SAS applications, view your notif ications, update your settings, access help, and sign out of SAS Drive. 2. The toolbar enables you to create new content, search, undo and redo your changes, and access the SAS Drive main menu. 3. The Quick Access area provides convenient access to your most -used items. 4. The tabs bar provides different views of your content. 5. The canvas displays the contents of the tab, f older, or search that is currently selected. 6. The inf ormation pane displays details and comments for the currently selected item. The displayed tabs depend on the products that are installed at your site. Note: My Folder is a shortcut to /SAS Content/Users/[userID]/MyFolder/.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Overview
1-11
Accessing SAS Visual Data Mining and Machine Learning Functionality Select Explore and Visualize from the Applications menu. You might see additional applications depending on permissions status.
14 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Users need an account to sign in to SAS Drive. Access to Visual Analytics and the Visual Data Mining and Machine Learning add-on functionality is determined by the permissions that are assigned to the account with which you sign in. SAS accounts and their respective permissions are administered using SAS Environment Manager. To sign out of SAS Visual Analytics, click (student, used while signing in) in the upper right corner and then click Sign out. You are prompted to save your report if changes were made since it was last saved. Be sure to save your work at regular intervals. By default, you are automatically signed out after 30 minutes of inactivity. SAS Visual Analytics saves one draft of each report per user. There is a feature for automatically recovering reports, which are saved every five seconds. (To save a report, click (Menu). Then select Save or Save As.)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-12
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
SAS Visual Analytics Window Use the SAS Visual Analytics Explore and Visualize window to do one of the following: access the Choose Data window to open or import data • create a new report •
•
open an existing report
15 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
When you access SAS Visual Data Mining and Machine Learning functionality through SAS Visual Analytics from SAS Drive, the Explore and Visualize window appears, unless a default is selected. In the window, you begin by working with either data or reports. Click New Report to start a new report. Click My Folder or All Reports to access a report that already exists. Click Start with Data to access the Choose Data window to import new data sources or load existing data sources into a report. The Choose Data window displays three tabs- the Available tab, the Data Sources tab, and the Import tab. The Available tab displays all tables and files that have been loaded to memory from any CAS server to which you have access. The Data Sources tab enables you to create a connection to external data sources. This connection is a caslib. If the connection is successful, tables and files that you are authorized to access are available on the Data Sources tab. It can also be used to work with caslibs, and tables on a CAS server. The Import tab enables you to copy data that is accessed with a caslib. Use following options on Import tab to copy different types of data: Directory
Use Directory option to copy content from multiple tables or files and load it to an in-memory table. You can import delimited files, images, documents and ORC (Optimized Row Columnar- an open source data storage format) files.
Folders
You can use Folders option to import files that were saved to a SAS folder.
Local files
You can import data from a Microsoft Excel spreadsheet (XLS or XLSX), a text file (CSV or TXT), or a SAS data set (SASHDAT or SAS7BDAT).
Social media
After authenticating with Facebook, Google Analytics, Twitter, or YouTube and providing search criteria, you can import data to the CAS server. Note: Your access to, and use of, social media data through a social media provider’s public APIs is subject to the social media provider’s applicable license terms, terms of use, and other usage terms and policies.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Overview
1-13
Note: Environmental Systems Research Institute (Esri) data can also be loaded to the CAS server if your login has appropriate privileges.
SAS Visual Analytics Interface Page tabs
Menu
Data pane
Object options
Object selection pane
Object variable roles
Canvas with object 16 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The lef t pane on the SAS Visual Analytics report page contains the f ollowing icons: Data
The Data pane enables you to work with data sources, create new data items (hierarchy, calculated item, custom category), add a data source f ilter, and view and modif y properties f or data items.
Objects
The Objects pane provides a list of tables, graphs, gauges, controls, containers, and many modeling and machine learning objects that can be included in the report.
Suggest
The Suggestions pane provides you with suggested objects af ter selecting a data source. This f eature generates objects to help you consider new options for your data.
Outline
The Outline pane enables you to view and work with pages and objects in your report.
The right pane contains the f ollowing icons: Options
The Options pane lists the options and styles that are available f or the currently selected report, page, or report object.
Roles
The Roles pane enables you to add or modify role assignments for the currently selected report object.
Actions
The Actions pane enables you to create links, f ilter actions, and link selection actions between objects.
Rules
The Rules pane enables you to view, add, or modify rules as to how the currently selected object is displayed (for example, expression, color-mapped values, and gauge).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-14
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
Filters
The Filters pane enables you to view, add, or modif y f ilters f or the selected report object.
Ranks
The Ranks pane enables you to view, add, or modify rankings for the selected report object.
SAS Visual Data Mining and Machine Learning Objects
17 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
To access the SAS Visual Data Mining and Machine Learning objects list, select the Objects pane in SAS Visual Analytics and scroll down. If SAS Visual Data Mining and Machine Learning is licensed at your site (and you have permission), the list of objects can be accessed under SAS Visual Data Mining and Machine Learning.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Data Exploration
1.2 Data Exploration Objectives •
Load data into CAS.
•
Explore data
20 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Idea Exchange •
How many rows does a typical data set (table) that you work with contain?
•
How many variables (fields)?
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-15
1-16
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
Course Data Introduction Anonymized and transformed campaign data from large financial services firm’s accounts, including home equity lines of credit, loans, and other short-term to medium-term credit instruments from more than a half-year time interval • focuses on direct and indirect promotions •
•
has three target variables (B_TGT is the focus.)
22 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Data Description The data set used in this course is a large data set, with more than 1 million rows (or observations) and 24 columns (or fields). Three target variables are provided in the course data, but the primary focus is on the binary target (B_TGT, see below). The data set VS_Bank_Part consists of observations that were taken from a large financial services firm’s accounts. Accounts in the data represent consumers of home equity lines of credit, automobile loans, and other types of short-term to medium-term credit instruments. The data were anonymized and transformed to conform to the description that follows. A campaign interval duration for the bank is half of a year. Campaign is used here to denote all marketing efforts that provide information about and motivate the contracting (purchase) of the bank’s financial services products. Campaign promotions are categorized into direct and indirect. Direct promotions consist of sales offers that involve an incentive to a particular account. Indirect promotions are marketing efforts that do not involve an incentive. In addition to the account identifier (Account ID), the variables below are in the data set. Target variables quantify account responses during the current campaign season. Name
Label
Description
B_TGT
Tgt Binary New Product
A binary target variable. Accounts coded with a 1 contracted for at least one product in the previous campaign season. Accounts coded with a zero did not contract for a product in the previous campaign season.
INT_TGT
Tgt Interval New Sales
The amount of the financial services products (sum of sales) per account in the previous campaign season, denominated in US dollars.
CNT_TGT
Tgt Count Number New Products
The number of the financial services products (count) per account in the previous campaign season.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Data Exploration
1-17
Categorical-valued inputs summarize account-level attributes that are related to the propensity to buy products and other characteristics that are related to profitability and creditworthiness. These variables were transformed to anonymize account-level information and to mitigate quality issues that are related to excessive cardinality. Name
Label
CAT_INPUT1
Category 1 Account Activity Level
Description A three-level categorical variable that codes the activity of each account. • X → high activity. The account enters the current campaign period with many products. • Y → average activity. • Z → low activity.
CAT_INPUT2
Category 2 Customer Value Level
A five-level (A-E) categorical variable that codes customer value. For example, the most profitable and creditworthy customers are coded with A.
Interval-valued inputs provide continuous measures on account-level attributes that are related to the recency, frequency, and sales amounts (RFM). These variables were transformed to anonymize account-level information. All measures below correspond to activity that was prior to the current campaign season. Name
Label
Description
RFM1
RFM1 Average Sales Past 3 Years
Average sales amount attributed to each account during the past three years
RFM2
RFM2 Average Sales Lifetime
Average sales amount attributed to each account during the account’s tenure
RFM3
RFM3 Avg Sales Past 3 Years Dir Promo Resp
Average sales amount attributed to each account in the past three years in response to a direct promotion
RFM4
RFM4 Last Product Purchase Amount
Amount of the last product purchased
RFM5
RFM5 Count Purchased Past 3 Years
Number of products purchased in the past three years
RFM6
RFM6 Count Purchased Lifetime
Total number of products purchased in each account’s tenure
RFM7
RFM7 Count Prchsd Past 3 Years Dir Promo Resp
Number of products purchased in the previous three years in response to a direct promotion
RFM8
RFM8 Count Prchsd Lifetime Dir Promo Resp
Total number of products purchased in the account’s tenure in response to a direct promotion
RFM9
RFM9 Months Since Last Purchase
Months since the last product purchase
RFM10
RFM10 Count Total Promos Past Year
Number of total promotions received by each account in the past year
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-18
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
Name
Label
Description
RFM11
RFM11 Count Direct Promos Past Year
Number of direct promotions received by each account in the past year
RFM12
RFM12 Customer Tenure
Customer tenure in months
Demographic variables describe the profile of each account in terms of income, homeownership, and other characteristics. Name
Label
Description
DEMOG_AGE
Demog Customer Age
Average age in each account’s demographic region
DEMOG_GENF
Demog Female Binary
A categorical variable that is 1 if the primary holder of the account if female and 0 otherwise
DEMOG_GENM
Demog Male Binary
A categorical variable that is 1 if the primary holder of the account is male and 0 otherwise
DEMOG_HO
Demog Homeowner Binary
A categorical variable that is 1 if the primary holder of the account is a homeowner and 0 otherwise
DEMOG_HOMEVAL
Demog Home Value
Average home value in each account’s demographic region
DEMOG_INC
Demog Income
Average income in each account’s demographic region
DEMOG_PR
Demog Percentage Retired
The percentage of retired people in each account’s demographic region
Partition Indicator
A binary variable that indicates training and validation observations
Partition Variable Partition_Indicator
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Data Exploration
1-19
Introduction to the SAS Visual Data Mining and Machine Learning Environment In this demonstration, you access the SAS Visual Data Mining and Machine Learning interface and load the data. 1. From the Windows desktop, launch your web browser. When the browser opens, select SAS Viya SAS Drive from the bookmarks bar or from the link on the page. 2. Enter student in the User ID field. 3. Enter Metadata0 in the Password field Note: Depending on the computing environment that you are using, the credentials might be different. Use caution when you enter the user ID and password because values can be case sensitive. 4. Click Sign In. 5. Select Yes in the Assumable Groups window. The SAS Drive home page appears.
Note: The SAS Drive page on your classroom computer might not have the same tiles as the image above.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-20
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
6. Click (Show list of applications) in the upper left corner of the SAS Drive page. Select Explore and Visualize.
The SAS Visual Analytics-Explore and Visualize window is displayed. 7. Import SAS data sets and load them to memory on a CAS server. a. Click Start with Data. b. In the Choose Data window, click Import. c. Under Import, select Local files Local file. d. Navigate to D:\Workshop\VDMO. Note: Depending on your compute environment, the data might be in a different location or already loaded in memory. e. Select all the three tables: vs_bank_part.sas7bdat, pva_partition.sas7bdat, and moviefm.sas7bdat. Click Open. The Choose Data window should appear.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Data Exploration
f.
1-21
Do not select the default Import Item. Instead, select Import All. It might take a few seconds for the tables to load into memory. There will be green check marks next to each table name when the load is completed.
g. Click OK. The tables are imported to the CAS server and are available to use with Visual Analytics. When the import is complete, the Data pane is displayed. It lists the data items from the vs_bank_part table. If vs_bank_part is not the active table, click moviefm or pva_partition and change tables. Note: The moviefm table is used for demonstrations in Factorization Machines.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-22
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
The Data pane enables you to work with data sources, create new data items (hierarchy, calculated item, custom category, interaction effect, spline effect, partition), add a data source filter, and view and modify properties for data items.
In the Data pane, distinct counts are displayed for each category data item. • Character and datetime data items are treated as categories. These are data items whose distinct values can be used to group and aggregate measures. • Numeric data items are treated as measures. These are data items whose values can be used in computations. Note: To load another table, select the menu next to the table name. 8. Save your report. Click (Menu) and select Save As. Save the report in My Folder My Tasks with the name Model Fitting. Click Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Data Exploration
1-23
Know Thy Data: Data Exploration One of the most critical and potentially time-consuming parts of an analytical project is exploring the data. “Get to know your data.” • Use graphical and numerical methods. •
•
Consider: outliers, minimums, maximums, percent missing, means, ranges, standard deviations, distributions (shapes from plots), and so on.
24 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Know thy data. Is it clean? What are the key characteristics? What about missing data, outliers, and so on? Failure to understand these aspects of your data will result in a f lawed report, f orecast, or model. When the proper data are obtained, exploring the data can be one of the most time-consuming parts of an analytical project. When exploring data, analysts try to gain intimate knowledge of the variables that the data contain. Both graphical and numerical methods are generally used to gain f amiliarity with the data. Common graphical tools include histograms, scatter plots, bar charts, and stem -andleaf plots. There are also more modern graphical tools, such as heat maps and word clouds, which scale well to large data sets. Numerical summary methods are also used to explore data. These include summary statistics for measure of central tendency such as the mean, median, or mode. Numeric measures of variability are used to explore data such as variance, standard deviation, range, or mid-range. Extreme values such as outliers, the minimum, or the maximum are used to explore data as well as counts or percentages of missing data. Bivariate measures such as correlation are also used.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-24
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
Exploring Data In this demonstration, you perform initial data exploration using Visual Analytics capabilities. 1. Begin the data exploration by examining the descriptive statistics of the numeric data items. Open the Model Fitting report that you saved in the previous demonstration. 2. In the Data pane, click
(Actions) and select View measure details.
The resulting Measure Details window displays basic statistics (minimum, maximum, average, sum) for all the measures.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Data Exploration
1-25
3. Select a measure in the above table to get additional statistics for that selected measure along with the plot of its distribution.
The variable tgt Binary New Product (b_tgt) is the primary dependent variable for categorical response modeling in this course. It is a binary flag that codes responders with 1 and nonresponders with 0. Because it is numeric, it is treated as interval valued by default. A closer examination of the Measure Details table indicates that several observations have missing entries for some variables (demog Customer Age, RFM3). We impute the mean for some of these variables in the next few steps. Also, we can observe that both RFM1 (Average Sales Past Three Years) and RFM4 (Last Product Purchase Amount) have negative values. Customers cannot have negative amounts of past sales and product purchase. We are going to transform these values as well. 4. Select Close to exit the Measure Details window. 5. The target variable of interest (tgt Binary New Product) needs to be changed to categorical variable for subsequent modeling demonstrations. Select Edit properties for tgt Binary New Product and change its classification to Category.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-26
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
6. Drag the tgt Binary New Product variable to the canvas to create the auto chart. A bar chart is created by def ault.
The bar chart shows that approximately 200,000 customers have tgt Binary New Product=1. That is, of the total of slightly more than 1 million customers, approximately 20% made a purchase. Note: You can change the f requency counts to percentages in the bar chart by replacing Frequency data item under Measure to Frequency Percent. Now we deal with variables having missing observations. While examining the Measure Details table, you noticed that several variables had missing observations. Let us focus on the demog Customer Age variable, which has around 266861 missing observations. The issue of missing values in data is (nearly) always present and always a concern. One of the biggest concerns with missing data is that often observations with missing data are omitted from an analysis, thus biasing the data. The process where algorithms use only cases with no missing data is called complete case analysis. Many modeling algorithms in SAS Visual Data Mining and Machine Learning operate under complete case analysis (for example, linear and logistic regression, neural networks and support vector machines). The typical approach to solving the missing data issue is imputation. Imputation is the act of plugging in a “reasonable” value for a missing observation. This reasonable guess might be based on a summary statistic such as the mean or median, or it might come from a more advanced method where a model might be built using other inputs to predict the missing quantity, for example. The next few steps explain how visual analytics capabilities can be used to plug in the mean value for a missing observation in the demog Customer Age variable. 7. Impute missing observations in demog Customer Age. a. On the Data pane, select New data item Calculated item. b. In the Name field, enter Imp_demog_Customer_Age. c. For the Result Type field, verify that Automatic (Numeric) is selected.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Data Exploration
d. For the Format field, click (Edit). 1) In the Format window, select Numeric. 2) For the Width field, verify that 12 is specified. 3) Click OK. e. On the left side of the window, click Operators.
f.
Expand the Boolean group.
g. Double-click the IF…ELSE operator to add it to the expression. h. Expand the Comparison group. i. Drag x=y to the condition field in the expression. (Alternatively, you can right-click the condition box and select Use Inside x=y.) j.
On the left side of the window, click Data Items.
k. Expand the Numeric group. l.
Drag Demog Customer Age to the number field on the left of the equal sign.
m. Enter . (missing) in the number field on the right of the equal sign. n. Enter 58.72 on the number field for the RETURN operator. Note: The average for demog Customer Age is 58.72 as listed in the Measure Details table above. o. Drag demog Customer Age to the number field for the ELSE operator. The expression should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-27
1-28
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
p. In the lower right corner of the window, click OK. The new data item is added to the Data pane.
Similarly, other variables with missing observations can be imputed with any reasonable value supplied by analyst. For brevity, missing value imputation of other variables is not discussed. Recall that RFM1 (Average Sales Past Three Years) and RFM4 (Last Product Purchase Amount) have negative values. It is unlikely that the past sales and product purchase values would be negative. One of the approaches for addressing such variables is to first recode negative values to missing and then replace these missing values with the mean or any other average. The steps that follow show how this can be achieved using Visual Analytics functionality for the RFM1 variable. 8. Recode negative values to missing a. Again, on the Data pane, select New data item Calculated item. b. In the Name field, enter Recoded_rfm1. c. For the Result Type field, verify that Automatic (Numeric) is selected. d.
For the Format field, click (Edit). 1) In the Format window, select Numeric. 2) For the Width field, verify that 12 is specified. 3) Click OK.
e. On the left side of the window, click Operators. f.
Expand the Boolean group.
g. Double-click the IF…ELSE operator to add it to the expression. h. Expand the Comparison group.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Data Exploration
i.
Drag x < y to the condition field in the expression. (Alternatively, you can right-click the condition box and select Use Inside x < y.)
j.
On the left side of the window, click Data Items.
1-29
k. Expand the Numeric group. l.
Drag rfm1 Average Sales Past 3 years to the number field on the left of the inequality sign.
m. Enter 0 in the number field on the right of the inequality sign. n. Enter . (missing) in the number field for the RETURN operator. o. Drag rfm1 Average Sales Past 3 years to the number field for the ELSE operator. The expression should resemble the following:
p. In the lower right corner of the window, click OK. The new data item Recoded_rfm1 is added to the Data pane. Now the missing values created in the Recoded_rfm1 variable will be imputed using the mean of this variable. (You can again use the measure details property to see the descriptive statistics of all the variables, including this newly created data item.) The mean of Recoded_rfm1 variable is f ound to be 16.09, and the same value will be used f or imputation. 9. Impute missing observations for the Recoded_rfm1 variable a. Repeat steps 7.a through 7.o to calculate a new data item, Imputed_rfm1, by replacing missing observations of the Recoded_rfm1 variable with its mean, 16.09. The final expression window should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-30
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
b. In the lower right corner of the window, click OK. The new data item, Imputed_rfm1, is added to the Data pane. 10. In the upper left corner of the report, click
(New page) next to Page 1.
11. In the left pane, click the Data icon. Drag Imputed_rfm1 to the canvas. The results are skewed to the right. For some models, it is useful to use a regularizing transformation for input variables. The log transformation regularizes this distribution. 12. Create a transformed variable. a. On the Data pane, select New data item Calculated item. b. In the Name f ield, enter Transformed_rfm1. c. For the Result Type f ield, verif y that Automatic (Numeric) is selected. d. For the Format field, click (Edit). 1) In the Format window, select Numeric. 2) For the Width field, verify that 12 is specified. 3) Click OK. e. On the left side of the window, click Operators. f.
Expand the Numeric (advanced) group.
g. Double-click the Log operator to add it to the expression. h. On the left side of the window, click Data Items i.
Expand the Numeric group.
j.
Drag Imputed_rfm1 to the number field to the left of Log.
k. Enter 10 in the second number box. To avoid calculating the log of any zero values in the Imputed_rfm1 column, add 1 in the formula. l.
Click the Text tab The expression should resemble the following:
m. Enter +1 after ‘Imputed_rfm1'n and insert parentheses such that the expression resembles the following:
Alternatively, on the Visual tab, you can drag the x+y operator to the variable name to add 1 to it.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Data Exploration
1-31
n. Click Preview in the lower right corner of the window to preview the result to determine whether the f ormula has taken ef f ect. o. Close the preview result window. p. In the lower right corner of the window, click OK. The new data item is added to the Data pane. 13. Click Page 2 and drag transformed_rfm1 to the right of the existing histogram on the canvas to create a new auto chart. 14. Compare the histograms.
The inputs for modeling are the RFM variables. These variables had missing values and are heavily skewed. Therefore, a transformed and imputed version of each is in the data set, with the prefix logi_ and i_. These were already added in the data for convenience. 15. Save your report. Click
(Menu) and select Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-32
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
1.01 Multiple Choice Question A category variable can be created from a measure variable by editing which of the following property types? a. b.
aggregation format
c. d.
classification This cannot be done.
26 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Data Exploration
1-33
Practice You work with the PVA_PARTITION data set for practices. It contains data that represent charitable donations made to a veterans’ organization. The data represent the results of a mail campaign to solicit donations. Solicitations involve sending a small gift to an individual and include a request for a donation. The data set contains the following information: • a flag to indicate respondents to the appeal (Target Gift Flag) and the dollar amount of their donations (Target Gift Amount) • respondents’ PVA promotion and giving history • demographic data of the respondents In the first practice, you use SAS Visual Analytics to familiarize yourself with the data. 1. Using SAS Visual Analytics a. Sign in to SAS Visual Analytics. Enter student in the User ID field and Metadata0 in the Password field. b. Select Explore and Visualize to begin accessing and exploring the data. c. Select the PVA_PARTITION data source. d. Select the Data pane on the left of the canvas (if it is not open). 1) Which level of the Status Category 96NK variable has the highest count? _____________ 2) Does the variable Age contain any missing values? If so how many? ____________________________ 3) What is the average of Target Gift Amount? _________________________________ e. Change Target Gift Flag from a measure to a category. It is a binary indicator that represents a response to a mailing, where 1 indicates that customers did respond. 1) How are responders and non-responders distributed in the data?__________________ 2) How many females responded to the campaign?________________ f. Save the report. Click (Menu) and select Save As. Save the report in My Folder My Tasks with the name Exercise 1. Click Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-34
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
1.3 SAS Viya: Details Objectives •
Discuss SAS Viya functionality.
•
Discuss SAS Viya interfaces.
32 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Viya Functionality Among the core products on SAS Viya, available functionality is cumulative. SAS Visual Analytics provides baseline functionality, including reporting and basic analytics. • SAS Visual Statistics provides an additional set of advanced analytic functions. •
•
SAS Visual Data Mining and Machine Learning provides a second additional set of advanced analytic functions.
33 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 SAS Viya: Details
1-35
SAS Visual Analytics provides easy-to-use data discovery, analysis, and visualization capabilities. Novices can explore and analyze business-critical data without assistance and then distribute dynamic dashboards to others. The interface provides interactive graphics as well as auto-charting capabilities that help new users choose the best visualization for their data. It is easy to generate analytical visualizations, interactively filter data, format variables, add ad hoc calculated columns , and create dynamic hierarchies without the need for predefined dimensional data structures. SAS Visual Data Mining and Machine Learning is an interactive modeling tool that can be added to SAS Visual Analytics and SAS Visual Statistics for increased predictive capabilities. The modeling competencies of SAS Visual Data Mining and Machine Learning are included within the common user interface of SAS Visual Analytics Explorer. This means that users can explore data and create models from the same, easy-to-use environment. By combining these products, you get an inmemory platform that provides self-service data exploration, visualization, and powerful descriptive and predictive analytics along with machine learning algorithms.
SAS Viya Functionality SAS Visual Analytics •
Explore data and discovery relationships
•
Examine distributions and summary statistics
•
Perform post-model analysis and reporting SAS Visual Statistics •
Build unsupervised and supervised models
•
Interactively refine candidate models
•
Compare models and generate score code SAS Visual Data Mining and Machine Learning •
Six additional machine learning models
34 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-36
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
SAS Viya Interfaces Multiple Interfaces, Including Visual and Programmatic
35 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Users can experience powerf ul statistical modeling and machine learning techniques via a variety of interf aces. • This course focuses on the user-friendly, web-based HTML5 reporting interface that targets business users and report builders. • SAS Studio offers tasks that enable you to point and click in an interactive environment that builds code in the background. • SAS Studio provides access to a SAS programming environment and a web-based interface. SAS Studio also delivers prebuilt code snippets. • In addition, users can use their favorite third-party interface (for example, Jupyter) to write and run R, Python, Java, or Lua code. Regardless of which interf ace is used, the same CAS actions are applied behind the scenes f or the same procedure. This supplies important consistency.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 SAS Viya: Details
1-37
SAS Visual Data Mining and Machine Learning in SAS Viya Machine Learning Techniques (Visual Interface) Bayesian Networks • Factorization Machine •
Forest • Gradient Boosting •
• •
Neural Network Support Vector Machine
Machine Learning Procedures (Indicative list) (SAS Studio Programmatic Interface) • •
• • •
• • • • •
• •
FACTMAC (Factorization Machine Model) FOREST (Forest Model) GRADBOOST (Gradient Boosting Model) NNET (Neural Network) SVMACHINE (Support Vector Machine) SVDD (Support Vector Data Description) BNET (Bayesian Network) BOOLRULE (Boolean Rules) FASTKNN (k-nearest neighbor) GVARCLUS (Variable Clustering and Graphical Modeling) MBANALYSIS (Association Rule Mining) RPCA (Robust Principal Component Analysis) 36
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Above slide displays indicative list of data mining and machine learning procedures that are available in SAS Visual Data Mining and Machine Learning. These procedures provide data mining and machine learning algorithms that have been specially developed to take advantage of the distributed environment that the SAS Viya platform provides. Supervised learning methods that are available include forest, gradient boosting, neural networks, support vector machines, factorization machines and Bayesian networks. In addition to the data mining and machine learning procedures, SAS Visual Data Mining and Machine Learning provides procedures for sampling, data exploration, clustering, dimension reduction, and model assessment. Procedures for scoring via an analytic store and for text mining are also included. Several additional procedures are listed here for reference: ASTORE (Analytic Store), FISM (Frequent Item Set Mining), MTLEARN (Multitask Learning), SEMISUPLEARN (Semi supervised Learning), KPCL (Kernel Principal Component Analysis), GMM (Gaussian Mixture Model) and TEXTMINE (Text Mining).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-38
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
Interfaces to SAS Viya Although SAS Viya can be used by various SAS applications, it also enables you to access analytic methods from SAS, Python, Lua, and Java, as well as through a REST interface that uses HTTP or HTTPS.
37 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Viya greatly increases the openness of the SAS Platform. To access the SAS In-Memory Analytics engine, you have several options: • Use web-based visual interfaces. • Use programming interfaces such as SAS Studio. • Call CAS actions from other languages such as R, Python, LUA, and Java. • Call SAS APIs directly by using available REST APIs. SAS Viya enables you to use any of the interfaces to run your code.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 SAS Viya: Details
A Mindset Shift
Distributed computing environment Randomness in algorithms
Convergence of model parameters
38 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In SAS Viya, you might have nondeterministic results or might not get the reproducible results, essentially because of two reasons: • distributed computing environment • nondeterministic algorithms
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-39
1-40
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
In distributed computing, cases are divided over compute nodes, and there could be variation in the results. You might get slightly different results even on the same server when the controllers or workers are more manageable. On dif ferent servers, this is even more expected. The CAS server represents pooled memory and runs code multi-threaded. Multi-threading tends to distribute the same instructions to other available threads for execution creating many different queues on many different cores using separate allocations or subsets of data. Most of the time, multiple threads perform operations on isolated collections of data that are independent of one another, but part of a larger table. For that reason, it is possible to have a counter (for example, n+1) operating on one thread to produce a result that might be different from a counter operating on another thread, because each thread is working on a different subset of the data. Therefore, results can be different from thread to thread unless and until the individual results from multiple threads are summed together. It is not as complicated as it might sound. That is because SAS Viya automatically takes care of most collation and reassembly of processing results, with a few minor exceptions where you must further specify how to combine results from multiple threads. A nondeterministic algorithm is an algorithm that, even for the same input, can exhibit different behaviors on different runs, as opposed to a deterministic algorithm. There are several ways an algorithm might behave differently from run to run. A concurrent algorithm can perform differently on different runs due to a race condition. A probabilistic algorithm's behaviors depend on a random number generator. The nondeterministic algorithms are often used to find an approximation to a solution when the exact solution would be too costly to obtain using a deterministic one (Wikipedia). Some SAS Visual Data Mining and Machine Learning models are created with a nondeterministic process. This means that you might experience different displayed results when you run a model, save that model, close the model, and reopen the report or print the report later. It is an altogether different mindset! You are converging on a model or estimating a model, not exactly computing the parameters of the model. Bayesians understand this when they look for convergence of parameters. They try to converge to a distribution, not a point. Maybe it would be interesting to try running the models 10 times across different samples and assembling them to see the dominant signal. You cannot expect the results being reproduced because some algorithms have randomness in them . However, the results are converged. This is a distinguished computing environment aimed to big data, and this is the price to succeed.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Solutions
1-41
1.4 Solutions Solutions to Practices 1. Using SAS Visual Analytics a. Sign in to SAS Visual Analytics. Enter student in the User ID field and Metadata0 in the Password field. Note: Depending on the computing environment that you are using, the credentials might be different. b. Click the Applications menu Explore and Visualize.
in the upper left corner of the SAS Drive page. Select
c. Select the PVA_PARTITION data source. d. Select the Data pane on the left of the canvas (if it is not open). 1) Drag the Status Category 96NK variable onto the canvas to create the bar chart. Examining the bar chart clearly shows that the level A of the variable has the highest frequency count. 2) Click (Actions) and select View measure details. This takes you to the Measure Details window, which shows descriptive statistics for numeric data items. View the lower half of the Measure Details table to see the number of missing observations for the chosen variable. Age contains 26,477 missing observations.
3) What is the average of Target Gift Amount? 15.62
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-42
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
e. Change Target Gift Flag from a measure to a category. It is a binary indicator that represents a response to a mailing, where 1 indicates that customers did respond. 1) Click + next to the page 1 tab to create a new page. Drag the variable Target Gift Flag onto the canvas. The bar chart that is created shows an equal distribution of responders and non-responders. 2) Select this bar chart, go to Data Roles on the right side, and add Gender under the Group role. Put your cursor on the appropriate bar to determine the number of females who responded to the campaign: 28,699. f. Save your report. Click (Menu) and select Save As. Save the report in My Folder My Tasks with the name Exercise 1. Click Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Solutions
Solutions to Activities and Questions
1.01 Multiple Choice Question – Correct Answer A category variable can be created from a measure variable by editing which of the following property types? a. b.
aggregation format
c. d.
classification This cannot be done.
27 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-43
1-44
Lesson 1 Introduction to SAS® Visual Data Mining and Machine Learning
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 2 Algorithms 2.1
Machine Learning
Introduction ................................................................................................................. 2-3 Demonstration: Partitioning Data............................................................................... 2-7
2.2
Neural Networks ........................................................................................................ 2-10 Demonstration: Training and Exploring a Neural Network Model in SAS Visual Data Mining and Machine Learning........................................................... 2-33 Practice............................................................................................................... 2-47
2.3
Support Vector Machines........................................................................................... 2-49 Demonstration: Training and Exploring an SVM Model in SAS Visual Data Mining and Machine Learning..................................................................... 2-68 Practice............................................................................................................... 2-76
2.4
Forests ...................................................................................................................... 2-77 Demonstration: Training and Exploring a Forest Model in SAS Visual Data Mining and Machine Learning..................................................................... 2-98 Practice............................................................................................................. 2-107
2.5
Gradient Boosting.................................................................................................... 2-108 Demonstration: Training and Exploring a Gradient Boosting Model in SAS Visual Data Mining and Machine Learning ................................................. 2-121 Practice............................................................................................................. 2-129
2.6
Bayesian Networks .................................................................................................. 2-130 Demonstration: Training and Exploring a Bayesian Network Classifier in SAS Visual Data Mining and Machine Learning ................................................. 2-146
2.7
Solutions ................................................................................................................. 2-160 Solutions to Practices .......................................................................................... 2-160 Solutions to Activities and Questions...................................................................... 2-168
2-2
Lesson 2 Machine Learning Algorithms
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Introduction
2-3
2.1 Introduction Objectives •
Discuss machine learning and its applications.
•
Discuss data partitioning and honest assessment.
3 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Machine Learning Machine learning is a branch of artificial intelligence that automates the building of systems that learn iteratively from data, identify patterns, and predict future results – with minimal human intervention. It shares many approaches with other related fields, but it focuses on predictive accuracy rather than interpretability of the model.
4 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Machine learning is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention. Machine learning is not new, but growing volumes and varieties of available data, computational processing that is cheaper and more powerful, and affordable data storage have made it more popular than ever. Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-4
Lesson 2 Machine Learning Algorithms
Machine Learning: Multidisciplinary Nature
Statistics
Data Mining
Natural Language Processing
Artificial Intelligence
Machine Learning
Deep Learning
Computer Vision Neurocomputing
5 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Machine learning today is a mathematically rigorous discipline that encompasses sophisticated modeling, optimization, and learning research. It has concrete applications in medicine, software, robotics, and traditional business problems. Particularly in the business problem domain, there is significant overlap among the fields of data science, data mining, and machine learning. In contrast to many statistical modeling approaches, which can value inference over prediction, the focus of machine learning is predictive accuracy (Breiman 2001a). High predictive accuracy is usually achieved by training complex models, which often involves advanced numerical optimization routines, on a large number of training examples. Because of their almost uninterpretable internal mechanisms, some machine learning algorithms have been labeled “black box” techniques. Yet algorithms such as neural networks, random forests, support vector machines, and gradient boosting can learn faint and nonlinear patterns from training data that generalize well in test data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Introduction
2-5
Model Complexity Just right Too flexible
9 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Fitting a model to data requires searching through the space of possible models. Constructi ng a model with good generalization requires choosing the right complexity. For regression, including more terms in the model increases complexity. Selecting model complexity involves a tradeoff between bias and variance. An insufficiently complex model might not be flexible enough. This leads to underfitting—that is, systematically missing the signal (the true relationships). This leads to biased inferences, which are inferences that are not the true ones in the population. A naive modeler might assume that the most complex model should always outperform the others, but this is not the case. An overly complex model might be too flexible. This leads to overfitting—that is, accommodating nuances of the random noise (chance relationships) in the particular sample. This leads to models that have higher variance when applied to a population. A model with just enough flexibility gives the best generalization.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-6
Lesson 2 Machine Learning Algorithms
Data Partitioning and Honest Assessment •
Partition available data into training and validation sets.
•
Train a series of models on the training data set, and model performance is evaluated on the validation data set. Training Data inputs
Validation Data target
inputs
target
10 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The strategy for choosing model complexity is to use honest assessment. With honest assessment, you select the model that performs best on a validation data set, which is not used to fit the model. Assessing performance on the same data set that was used to develop the model leads to selecting too complex a model (overfitting). Note: The classic example of this is selecting linear regression models based on R square. In predictive modeling, the standard strategy for honest assessment of model performance is data splitting. A portion is used for fitting the model. That portion is the training data set. The remaining data are separated for empirical validation. The validation data set is used for monitoring and tuning the model to improve its generalization. The tuning process usually involves selecting among models of different types and complexities. The tuning process optimizes the selected model on the validation data. Note: Because the validation data are used to select from a set of related models, reported performance will be overstated, on the average. Consequently, a further holdout sample is needed for a final, unbiased assessment. The test data set has only one use, which is to give a final honest estimate of generalization. Cases in the test set must be treated in the same way that new data would be treated. The cases cannot be involved in any way in the determination of the fitted prediction model. In practice, many analysts see no need for a final honest assessment of generalization. An optimal model is chosen using the validation data, and the model assessment measured on the validation data is reported as an upper bound on the performance expected when the model is deployed.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Introduction
2-7
Partitioning Data In this demonstration, you create a partition variable using Data pane options. This partition variable can be used to perform model validation for various machine learning algorithms. 1. Open the previous report, Model Fitting. In the upper right corner, click (Menu) and select Open My Folder My Tasks Model Fitting. 2. Click Open. 3. In the Data pane, click New data item and select Partition.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-8
Lesson 2 Machine Learning Algorithms
4. In the New Partition window, enter Part_Ind in the Name field and 50 in the Training partition sampling percentage field. Your New Partition window properties should appear as follows:
The Number of partitions option indirectly specifies whether a test partition is included. When you select 3 for this option, a training partition, validation partition, and testing partition are created. When you select 2, only a training partition and validation partition are created. At least 1% of the data must be used for the validation data. Therefore, the sum of the training partition and testing partition must be strictly less than 100. Note: When using a generated partition variable, results are nondeterministic. If you close and reopen a report with a generated partition variable, subsequent results might not match the initial results. When you duplicate or change an object to a cluster or factorization machine object, the partition variable is dropped. However, you can create a partition variable for the cluster or factorization machine object. When you select the Random number seed option and specify a Random seed, even then you might get nondeterministic results. This is due to the difference in data distribution and computational threads or to the walker used to sample the partition column. 5. Click OK. The Partition variable is added to the Category list. 6. In the upper left corner of the report, click
(New page) next to Page 2.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Introduction
7. In the left pane, click the Data icon. Drag Part_Ind to the canvas.
The variable Part_Ind contains values 0 and 1 for each partition created. The bar chart above exhibits that the two partitions have nearly equal frequency. 8. Save your report. Click
(Menu) and select Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-9
2-10
Lesson 2 Machine Learning Algorithms
2.2 Neural Networks Objectives •
Describe neural network basics.
Discuss options related to neural networks in SAS Visual Data Mining and Machine Learning. • Train and explore a neural network model. •
14 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-11
Linear Regression Models as Neural Networks
X1 X2
Inputs
β1 β2
f(β0+β1X1+β2X2 +.... +βn Xn)
…
Y
Output
βn-1
Xn-1
βn
β0
Xn
15 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The basic unit of computation in a neural network is the neuron, often called a node. It receives input from some other nodes and computes an output. The slide above depicts a neural network with only one neuron. You can think of a linear regression as a neural network. The central processing unit does two things: it computes the sum of the weighted inputs (parameters times X’s) plus the intercept term, and then this weighted sum is put into a transformation function (in this case, the Identity function). The output of this function is the output of the processing unit.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-12
Lesson 2 Machine Learning Algorithms
Neural Network Prediction Formula hidden unit prediction estimate bias estimate
weight estimate
1 tanh -5
0
-1
5
activation function
16 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
...
Like regressions, neural networks predict cases using a mathematical equation involving the values of the input variables. A neural network prediction f ormula can be thought of as regression of response variable on a set of derived inputs, called hidden units. In turn, these hidden units can be thought of as regressions on the original inputs. The hidden unit “regressions” include a def ault link f unction (an activation f unction, in neural network language), the hyperbolic tangent. Because of a neural network’s biological roots, its components receive dif ferent names f rom corresponding components of a regression model. Instead of an intercept term, a neural network has a bias term. Instead of parameter estimates, a neural network has weight estimates.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-13
Neural Network Binary Prediction Formula
5
logit link function 1 tanh
-50
0
-5
-1
15
17 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
When the target variable is binary, the main neural network regression equation receives the logit link function.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
...
2-14
Lesson 2 Machine Learning Algorithms
Neural Network Diagram
x1 x2
H1 H2
y
H3
input hidden target layer layer layer
18 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
...
Neural network models were originally inspired by neurophysiology and the interconnections between neurons, and they are often represented by a Network diagram instead of an equation. The basic model form arranges neurons in layers. The first layer, called the input layer, connects to a layer of neurons called a hidden layer, which in turn connects to a final layer called the target or output layer. Each element in the diagram has a counterpart in the network equation. The blocks in the diagram correspond to inputs, hidden units, and target variables. The block interconnect ions correspond to the network equation weights.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
Advantages of Neural Networks •
Neural networks are universal approximators.
The function describing the input-output relationship does not need to be specified. • Neural networks are one of the fastest scoring models. •
19 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
With a suf f icient number of hidden units and enough time, a neural network can model any input output relationship to any degree of precision. Although neural networks are parametric nonlinear regression models, they behave like nonparametric regression (smoothing splines), in that it is not necessary to specif y the f unctional f orm of the model. This allows construction of models when the relatio nship between the inputs and outputs is unknown. However, af ter it is trained, a neural network is among the f astest executing predictive models. This means that a trained neural network can ef f iciently score large volumes of data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-15
2-16
Lesson 2 Machine Learning Algorithms
Disadvantages of Neural Networks •
There is a lack of interpretability.
•
Unique optimal values for weights are not guaranteed
20 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Autotuning in SAS Visual Data Mining and Machine Learning Autotuning is the process of automatically and algorithmically adjusting model parameters to create a set of competing versions of one particular model. Autotune options: • Maximum seconds Maximum iterations • Maximum evaluations •
21 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
To create a good predictive model, you must decide which model and parameters to use. You can try a trial-and-error approach, or you can rely on experience and perso nal pref erence. However, neither of these guarantees that you will f ind the best model f or your data. SAS Visual Data Mining and Machine Learning can automate the process of determining optimum model parameters.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-17
Every model available f or SAS Visual Data Mining and Machine Learning in SAS Visual Analytics can be autotuned. The parameters that you can autotune are model dependent. However, f or all models, you can limit the maximum amount of time, training iterations, and model evaluations perf ormed during autotuning. This prevents server resources f rom being monopolized by a single user or task. Autotuning does run a slight risk of overfitting a model, especially when no partitioning is used. When autotuning a model, you can specif y the f ollowing options in the Autotune Hyperparameters window by clicking Autotune in the Options pane: • Maximum seconds: The maximum amount of time that the model will run, in seconds. You must specify a positive integer value. The autotuning algorithm always runs for a minimum of 60 seconds, even if you specify a value smaller than 60. • Maximum iterations: At each iteration, a set of models is created to be evaluated against each other. This property determines the number of sets of models that are created. You must specify a positive integer value. • Maximum evaluations: The maximum number of different models that are created for evaluation. You must specify a positive integer value. Autotuning selects optimal values for several parameters, which varies based on the selected machine learning algorithm. Note: In visual interface, you cannot control the autotune method.
Neural Network: Data Roles •
Response – one measure or category variable
Predictors – at least one measure or category variable • Partition ID – only one partition variable •
•
Weight – only one measure
22 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
To create a neural network model, you must assign a single variable under the response role. However, any number of measure or category variables can be assigned under the predictors role. In addition, you can specify a partition ID and weight variable. If a partition variable is identif ied and the Partition ID data role is specif ied, then holdout validation is perf ormed. Holdout validation builds the model using only the observations whose partition column value corresponds to the training data value. Next, the observations that correspond to the validation data value are f ed through the model. The model’s results are compared to the actual data set, and validation statistics are computed.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-18
Lesson 2 Machine Learning Algorithms
Note: If you do not have the partition variable, then you can create the one using SAS Visual Analytics f unctionality, as discussed in a previous section. Adding roles to the model automatically updates the model. If you do not want the model to be updated automatically when you add roles, select the Report menu f rom the top right corner and select Interface options Disable auto-refresh. Af ter you def ine all the roles, you can update the model by clicking Interface options Enable auto-refresh f rom the Report menu or clicking (Refresh).
Neural Network: Options General Event level • Autotune •
• •
Include missing Standardization
Maximum iterations • Maximum time •
Optimization method • L1 •
•
L2 23 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The f ollowing options are available f or neural networks: General • Event level – enables you to choose the event level of interest. When your category response variable contains more than two levels, SAS Visual Data Mining and Machine Learning treats all observations in the level of interest as an event and all other observations as nonevents. • Autotune – enables you to specify the constraints that control the autotuning algorithm. The constraints determine how long the algorithm can run, how many times the algorithm can run, and how many model evaluations are allowed. When you are unsure what to use for model settings, autotuning can be used to find optimal values for model hyperparameters. The autotuning algorithm selects the Number of hidden layers, Number of neurons, L1, L2, Learning rate, Annealing rate and Maximum iterations values that produce the best model. In addition, if data partitioning is applied, then the Auto-stop method, Auto-stop iterations and Goal value parameters are also autotuned. – Auto-stop method – specifies the method that controls early termination of the model-building algorithm. For Stagnation the algorithm terminates when it sees no improvement in the validation error for specified number of calculations. For Goal, the algorithm terminates when the validation error is less than the specified Goal value. – Auto-stop iterations – specifies the number of successive validation error calculations required to trigger the Stagnation auto-stop method. It can take integer value in the range 1 to 10000.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-19
– Goal value – specifies the maximum validation error acceptable to trigger the Goal auto-stop method. When the validation error is less than this value, the model-building algorithm terminates. • Include missing – specifies whether observations with missing values are included in the model. For category predictors, missing values are assigned their own measurement levels. For measure predictors, missing values are imputed with the measure variable's mean. • Distribution – specifies the distribution used to model the response variable. This option is available only when the response is a measure variable. Available choices are Normal, Gamma, and Poisson. • Output activation function – specifies the activation function that is used to create the output layer when using a normal distribution to model the response. With normal distribution selected the choice of activation function include Identity, Hyperbolic tangent and Sine. The exponential activation function is used with a Gamma or Poisson distribution. This option is available only when you specify a measure response and the normal distribution. • Standardization – specifies the method used to standardize the measure predictors. The possible values can be: – None – no standardization is applied. – Standard deviation – transforms predictors to have a mean of 0 and a standard deviation of 1. – Midrange – transforms predictors to the range of 0 to 1. • Maximum iterations – specifies the maximum number of optimization iterations that are used. • Maximum time – specifies the maximum time-out value in seconds. • Optimization method – specifies the optimization method used to train the neural network. You can specify either the SGD or LBFGS method. • Learning rate – specifies the learning rate parameter used in stochastic gradient descent (SGD) optimization method. • Annealing rate – specifies the annealing rate parameter used in stochastic gradient descent (SGD) optimization method. • L1 – specifies the L1 regularization parameter. It penalizes the absolute value of weight. The weights shrink by a constant amount toward zero. • L2 – specifies the L2 regularization parameter. The weights shrink by an amount that is proportional to weights. Note: More details about regularization techniques and learning rate are given in Appendix A.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-20
Lesson 2 Machine Learning Algorithms
Neural Network: Options Hidden Layers Number of hidden layers • Allow direct connections between input and target neurons •
Neurons • Activation functions •
Assessment • Number of bins Prediction cutoff • Statistic percentile •
•
Tolerance
24 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Hidden Layers • Number of hidden layers – specifies the number of hidden layers in the model. The maximum value allowed is 2 if the Optimization method is LBFGS. The maximum value is 5 if the Optimization method is SGD. • Allow direct connections between input and target neurons – specifies that each input neuron is connected to each output neuron in the network. • Neurons – specifies the number of neurons in the hidden layer. • Activation function – specifies the activation function used for each neuron’s output based on the weighted sum of its inputs. This can be changed for each hidden layer. You can specify either a Hyperbolic Tangent, Identity, Sine, Exponential, Logistic, Rectifier, or Softplus function as the activation function. Assessment • Number of bins – specifies the number of bins to use in the lift calculations in the Assessment plot. By default, set at 20. However, you can enter your own number of bins if desired. Increasing the number of bins increases the accuracy of the assessment at the expense of computing time. • Prediction cutoff – specifies the prediction cutoff value to determine whether an observation is a modeled event. By default, set at 0.5. Changing the default value affects the misclassification rate. • Statistic percentile – specif ies the depth f or the percentile bins to calculate the observed average, lif t, cumulative lif t, cumulative percentage captured, cumulative percentage events, and gain. • Tolerance – specifies the tolerance value that is used to determine the convergence of the iterative algorithm that estimates the percentiles. Specify a smaller value to increase the algorithmic precision.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-21
Neural Network: Model Display Options General Plot layout • Statistics to show •
Network Diagram • Neuron labels Neuron layout • Horizontal spacing •
Vertical spacing • Percentage of links to display •
Horizontal layout • Number of neurons to display •
25
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The model display properties co ntrol the output layout and the plots and statistics to show. The f ollowing display properties are available f or neural networks: General • Plot layout – specifies how the results windows are displayed on the canvas. Fit aligns all of the objects on the canvas automatically. Stack displays the objects as if they are in a slide deck. Only one object is displayed at a time. When Stack is specified, a control bar lets you move between objects. • Statistic to show – specifies which assessment statistic to display in the model. Network Diagram • Neuron labels – specifies whether the neurons in the Network diagram are labeled. • Neuron layout – specifies how the neurons are plotted in the Network diagram. • Horizontal spacing – specifies the amount of horizontal space that exists between nodes. • Vertical spacing – specifies the amount of vertical space that exists between nodes. • Percentage of links to display – specifies how many connecting links are displayed in the Network diagram. • Horizontal layout – specifies whether the width of the nodes in the Network diagram are fixed or adjust to the width of the Network diagram. • Number of neurons to display – specifies the maximum number of neurons that are displayed in the Network diagram.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-22
Lesson 2 Machine Learning Algorithms
Neural Network: Model Display Options Iteration Plot/ Relative Importance Plot Plot to show Assessment Plots •
• •
Plot to show Y axis
26 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Iteration Plot/ Relative Importance Plot • Plot to show – specifies which plot to show among the Iteration plot, Relative Importance plot, Partial Dependence plot or Validation Error plot (when validation data is available). Assessment Plots • Plot to show – specifies which assessment plot is displayed. For a category response variable, you can select Confusion matrix, Lift, ROC, or Misclassification. • Y axis – specifies which statistic is plotted in the Lift plot.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
Neural Network: Results
2-23
Summary bar
Iteration plot
Network plot
Assessment plot
27 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
By def ault, neural network results are displayed all together on one canvas. Change the Plot Layout option to Stack to place each result on its own canvas with tabbed access. Af ter a neural network model has been created, there is a summary bar along the top of the t hree panels that displays the results. Each of these panels in the results window is discussed in the slides to f ollow.
Analyzing Neural Network Results Three panes appear under a summary bar. They can help you analyze the results of the neural network model. Network plot – displays the input nodes, hidden nodes, connections, and output nodes of a neural network. • Iteration plot – plots the value of the objective/loss function at each iteration in the network building process. •
•
Assessment plot – helps determine how well the model fits the data.
28 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-24
Lesson 2 Machine Learning Algorithms
Neural Network Results : Summary Bar For all models, general model information appears at the top of the canvas. model type • name of response variable •
•
model evaluation criteria (selected from model toolbar or options) number of observations used to build the model
•
create pipeline button
•
29 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Note: The Create pipeline button enables the transfer of a model from SAS Visual Analytics to Model Studio. You can click the currently displayed model evaluation criteria in the summary bar to select a dif ferent choice f rom the pop-up menu. The choices depend on the model type that is used. For a neural network model with a categorical response variable the def ault statistic displayed is KS (Youden), whereas f or a measure response variable, ASE (Average square error) is displayed.
Neural Network Results : Network Plot •
The Network plot window depicts the input nodes, connections, and output nodes.
The size of the circle represents the absolute value of estimated weights. • Color indicates whether the weights are positive or negative. •
•
The width of the line between two nodes indicates the magnitude of strength and the color indicates the sign. 30 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-25
The Network diagram displays the input nodes, hidden nodes, connections, and output nodes of a neural network. Nodes are represented as circles, and links between the nodes are lines connecting two circles. The size of the circle represents the absolute value at t hat node, relative to the model, and the color indicates whether that value is positive or negative. Similarly, the size of the line between two nodes indicates the strength of the link, and the color indicates whether that value is positive or negative. To modif y the neural network, right-click the Network diagram and select one of the f ollowing options: • Add a hidden layer – inserts a new hidden layer into the neural network and rebuilds the model. • Edit a hidden layer – specify the hidden layer that you want to modify. In the Edit Hidden Layer window, specify the number of neurons and the activation function for this hidden layer. • Remove a hidden layer – removes a hidden layer from the neural network and rebuilds the model.
Neural Network Results : Iteration Plot Three plots are available to show in this window pane: Iteration plot • Relative importance plot •
•
Partial dependence plot
31 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The neural network model display property enables you to select one of the three plots f rom Iteration plot, Relative importance plot and Partial Dependence plot . By def ault, an Iteration plot is shown. However, you can switch to a Relative Importance plot or Partial Dependence plot by either changing the setting in the Options pane or right-clicking the Iteration Plot window and selecting f rom list of choices.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-26
Lesson 2 Machine Learning Algorithms
Neural Network Results : Iteration Plot The Iteration Plot window plots the value of the objective/loss function at each iteration in the network building process.
32 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Neural Network Results : Relative Importance Plot The Relative Importance Plot window shows the importance value of each input variable.
33 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Relative Importance plot displays the importance value of each input variable. The variables are ranked using their f irst-split log worth when applied to the scored training data. This plot can be empty if no variables are determined to be important.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-27
Neural Network Results: Partial Dependence Plot The Partial dependence (PD) plot depicts how the value of a given predictor affects the model’s predictions.
34 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Partial Dependence (PD) plots show how the model’s predictions partially depend on values of the input variables of interest. The simplest PD plots are one-way plots, which show how a model’s predictions depend on a single input to the model. You can create PD plots for model inputs of both measure and category variables. The Y axis of this plot displays the predicted probability of the event (for categorical response models) or the predicted response value (for measure response models), whereas the X axis displays the values of the specified predictor. The predictor with the largest importance value is shown by default on the X axis. To create a one-way PD plot, you must find the unique values of the plot variable in the training data set and identify the complementary variables. Then create a replicate of the training data for each unique value of the plot variable while considering the same values of the complementary variables as in the training data. Next, score each replicate by using the predictive model in hand and compute the average predicted values within each replicate. Finally, you see a plot that displays how the model’s prediction changes with respect to the plot variable. The adjoining plot shows the point estimate and a 95% confidence band. The heat map at the bottom of the plot indicates the density of the observations that were sampled. Some bins might have zero observations. The Partial Dependence plots are generated by using a sample of all observations, even if you are using a partition variable.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-28
Lesson 2 Machine Learning Algorithms
Neural Network Results: Assessment Plots Four plots are available to show in this window pane: •
The Assessment window defaults to a confusion matrix chart for a category response. A ROC chart, lift chart, and misclassification plot are also available.
35 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Assessment window enables you to choose to display a conf usion matrix chart, lif t chart, ROC chart, or misclassif ication plot. To switch to another plot, right-click in the center of the Assessment window and select the desired plot. Note: For a measure response, the assessment plot displays the predicted average versus the observed average response values .
Neural Network Results: Confusion Matrix
36 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-29
A confusion matrix is a perf ormance measurement technique f or categorical response models. It’s a simple table that helps us to know the perf ormance of the classification model on test data f or which the true values are known. It shows if the classif ication model is confusing two classes (that is, mislabeling one as another). Af ter a model is created, each observation has an observed value and a predicted value. The total number of each observed -predicted pair is calculated, which results in f our types of counts: true positives, true negatives, f alse positives , and f alse negatives. A true positive is a case that is predicted as an event (1) when it is a known event (1). A true negative case is predicted as a non-event (0) when it is a known non-event (0). A f alse positive is a case that is predicted as an event (1) when actually is a non-event (0), and a f alse negative is a case that is predicted as a non-event (0) when actually it is an event (1). In the previous slide, the diagonal values represent the number of cases f or which the predicted class is equal to the observed class (that is, true negatives and true positives), while off-diagonal values are those that are mislabeled by the classif ier (that is, f alse positives and f alse negatives). The higher the diagonal values of the conf usion matrix the better, indicating many correct predic tions. Any off-diagonal values represent a misclassif ication. Cells are shaded based on the proportion of the value in each cell to the number of observed values f or that level. Darker shaded cells show the concentration of the predictions f or the observed level. For a binary response, the inf ormation in the conf usion matrix is identical to the Misclassif ication plot.
Neural Network Results: Lift Chart
37 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Technically, lift is the ratio of the percent of captured responses within each percentile bin to the average percent of responses for the model. Similarly, cumulative lift is calculated by using all of the data up to and including the current percentile bin. The def ault lif t chart displays the cumulative lif t of the model. In simple words, a lif t chart is a graphical representation of the advantage (or lif t) of using a predictive model to improve upon target response versus not using a model. The def ault lif t chart displays the cumulative lif t of the model. The higher the lif t in the lower percentiles of the chart, the better the model is. The chart shows two lines, one that represents the model that you have built and one that represents the best model achievable (or a perf ect classif ier). The closer the Model line is to the Best line, especially in the lower percentiles, the better the model. Right-click in the center of the plot to switch between a cumulative chart and a lif t chart. Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-30
Lesson 2 Machine Learning Algorithms
Neural Network Results: ROC
38 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A receiver operating characteristic (ROC) chart displays the ability of a model to avoid false positive and f alse negative classif ications. A false positive classification means that an observation has been identif ied as an event when it is actually a nonevent. (This is also ref erred to as a Type I error.) A f alse negative classif ication means that an observation has been identif ied as a nonevent when it is actually an event. (This is also ref erred to as a Type II error.) The specif icity of a model is the true negative rate. To derive the f alse positive rate, subtract the specif icity from 1. The f alse positive rate, labeled 1 – Specificity, is the X axis of the ROC chart. The sensitivity of a model is the true positive rate. This is the Y axis of the ROC chart. Theref ore, the ROC chart plots how the true positive rate changes as the f alse positive rate changes. The classif ication accuracy of a model is demonstrated by the degree that the ROC curve pushes upward and to the lef t. This degree can be quantif ied by the area under the curve. The area will range f rom 50, f or a worthless model, to 100, f or a perf ect clas sifier. For a perf ect model, one with no f alse positives and no f alse negatives, the ROC chart would start at (0, 0), continue vertically to (0, 1), and then horizontally to (1, 1). In this instance, the model would correctly classify every observation bef ore a single misclassification could occur.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-31
The dotted red vertical line indicates the K-S (Youden) Statistic – the Kolmogorov-Smirnov (Youden) statistic – a goodness-of-fit statistic that represents the maximum distance between the model ROC curve and the baseline ROC curve. To see the actual number, position your pointer over the top or bottom of the red line.
Neural Network Results: Misclassification
39 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The misclassif ication plot displays how many observations were correctly and incorrectly classif ied f or each value of the response variable. When the response variable is not binary, the neural network model considers all levels that are not events as equal. Any change in prediction cutoff would update the misclassification chart. A signif icant number of misclassifications might indicate that the model does not fit the data well. Note: A misclassif ication plot is sensitive to the cut-off selected.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-32
Lesson 2 Machine Learning Algorithms
Details Table: Neural Network •
•
The Details Table pane provides detailed statistics about the model via the different tabs, which are model dependent. •
Model Information
•
Iteration History
•
Convergence
•
Confusion Matrix
•
Lift
•
ROC
•
Misclassification
•
Assessment Statistics
To display the details table, click Maximize mode.
on the object toolbar to enter
40 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Model Information – a summary of the f eatures of the model, including number of observations, number of neurons, weight and bias parameters, and the value of the objective f unction Iteration History – the value of the Objective and Loss f unctions at each iteration in the network building process Convergence – the convergence criterion that was reached and triggered the termination of the model building process Confusion Matrix – a summary of the correct and incorrect classif ications for the model Lift – the binned assessment results that are used to generate the Lif t plot Misclassification – a summary of the correct and incorrect classif ications for the model ROC – Sensitivity and 1-Specif icity values calculated at each cutof f value to generate the ROC plot Assessment Statistics – the value of any assessment statistics computed for the model
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-33
Training and Exploring a Neural Network Model in SAS Visual Data Mining and Machine Learning This demonstration illustrates the training and exploration of a neural network model in SAS Visual Data Mining and Machine Learning on the VS_BANK_PART data. 1. Open the Model Fitting report that you saved in the previous demonstration. 2. In the upper left corner of the report, click
(New page) next to Page 3.
3. For convenience, rename Page 4 (newly created page) to NN. a. Click
(Options) on the Page 4 tab and select Rename page.
b. Enter new name (NN) inside the box. 4. Select Objects and scroll down to the SAS Visual Data Mining and Machine Learning objects. 5. Highlight Neural Network, and either double-click it or drag it into the main field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-34
Lesson 2 Machine Learning Algorithms
6. Disable the auto-refresh option by clicking Menu at the top right of your screen and selecting Interface options Disable auto-refresh.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
7. Select the Roles pane. Add tgt Binary New Product as the Response variable. 8.
Add category 1 Account Activity Level, category 2 Customer Value Level and the twelve variables that begin with logi_rfm as Predictors.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-35
2-36
Lesson 2 Machine Learning Algorithms
Scrutiny of the neural network options reveals that the default architecture consists of one hidden layer and 10 hidden units (neurons). Input variable standardization is performed by default on interval (measure) valued inputs, and no direct connections are allowed between the input and target layers.
9. In the options pane under the Model Display section, select the Network Diagram property and select the box corresponding to the Neuron labels property. 10. Click
(Refresh) to fit the neural network model.
11. You can investigate assessment statistics from the summary pane. By default, the KS (Youden) value is displayed. Click KS (Youden) and select Misclassification Rate. The model uses 1,060,038 observations, and the model is built to predict event=1.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-37
The Network diagram illustrates the default architecture and consists of an input layer, one hidden layer with 10 neurons and an output layer. The size of each neuron is a function of the magnitude of the estimated weights (estimated parameter values) contained in it. Note: Because weight values start at random perturbations around zero and are iterated away from zero in the model fitting process, relatively small neuron sizes indicate that these components have a relatively small influence in determining predictions and might indicate an overly complex architecture for the signal in the data.
You can tune the model further by changing the number of hidden layers, the number of neurons in each hidden layer, activation function for each layer, or other options. Note: Some SAS Visual Data Mining and Machine Learning models are created with a nondeterministic process. This means that you might experience dif ferent results when opening the report again later.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-38
Lesson 2 Machine Learning Algorithms
Honest Assessment and Exploration of the Neural Network Model To perform honest assessment, you select the model that performs best on the validation data set. In the previous section, you created a partition indicator variable that can be used here to monitor the performance of model using validation data. However, it is noted that each time a report is generated within visual analytics, the partition is re-created. Therefore, it is suggested to add the partition indicator variable to the data table prior to pulling it into visual analytics. This would bypass the partition variable being re-created at the report level. The VS_BANK_PART data set also has a partition indicator variable already created just to prevent the recreation of partition variable at report level. In this course, by now you have two partition indicator variables . Therefore, for convenience, you might want to hide the Part_Ind variable created in a previous demonstration and use the Partition Indicator variable (which already exists in the data set) throughout. 1. To hide the variable, click the Data tab and locate the Part_Ind variable. 2. Right-click Part_Ind variable and select Hide. Note: To unhide any variable, use
(Actions) in the Data pane.
3. To perform honest assessment, click the Data tab and then find the partition indicator variable under Category. 4. Right-click Partition Indicator and select New partition.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-39
5. The Partition Indicator variable was created such that the validation data values are flagged as Validation and training data values are flagged as Training. Ensure that the New Partition window has settings as shown below.
6. Click OK to close New Partition window. 7. When the partition indicator variable is set as the partition column, it is ready to use for honest assessment. Click the Roles tab and assign Partition Indicator under Partition ID.
8. Click Refresh to update the model. 9. Results are updated to reflect the change in the underlying training sample. Note that the validation misclassification rate is 0.1698. 10. Click the Options tab at the right of your screen to explore the neural network model’s properties and options. 11. Click the Hidden Layers property and change Number of hidden layers to 2. Note: Alternatively, you could select Network Plot and right-click in the middle of the plot to choose the Add a hidden layer option from the list.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-40
Lesson 2 Machine Learning Algorithms
12. Under the Hidden layer 1 and Hidden layer 2 properties, change the Neurons value from the default value of 10 to 5. Note: Alternatively, you could select the network plot and then right-click to choose Edit hidden layer Hidden layer 1 and change the Neurons value. Similarly, the Neurons value for Hidden layer 2 can also be changed. 13. Click the Refresh button to update the model. The updated model produces a validation misclassification rate of 0.1655, which indicates a slight improvement over the previous result. 14. Click the Options pane. In the Model Display section, select General and change Plot layout to Stack. 15. Click the Iteration plot tab to see how the objective function changes at each iteration as the network is grown. To switch to a relative importance plot, right-click the Iteration Plot window and select Relative Importance plot from list of choices. Note: Alternatively, from the Options pane, under Model Display portion, you can choose appropriate plot to show by expanding Iteration Plot/ Relative Importance Plot section.
The variables in the Relative Importance plot are ranked using their first-split log worth in a decision tree when applied to the scored training data. Note: When validation data is available you can also switch to validation error plot , which displays the change in validation error at each iteration of the training process.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-41
16. Right-click on the Relative Importance Plot window, select Partial dependence, and click Refresh.
The PD plot can show if the relationship between the target and a predictor is linear, monotonic, or more complex. According to the model that was fit previously, the above plot shows the relationship between the transformed version of number of products purchased in past three years and the model’s prediction. Here, the model's predicted probability tends to increase when the log transformed number of products purchased in past three years increases, although this trend is not linear. Note: In PD plot the predictor that is shown by default is the predictor with the largest importance value. 17. In order to understand the relationship between the model’s predictions and other inputs, rightclick on the current PD plot and select a variable from the list of available inputs.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-42
Lesson 2 Machine Learning Algorithms
18. Select the category2 Customer Value Level variable and click Refresh.
As you might expect, the predicted probability of purchasing a product is minimum for the customers who are least profitable and creditworthy (that is, customers that belong to level E), whereas it is highest for the customers who are most profitable and creditworthy (that is, level A customers). This can be an important insight for the business. 19. Click the Assessment tab. By default, it displays a confusion matrix, which helps analyze the performance of a classifier. Below output summarizes the counts of true positives, false positives, true negatives and false negatives on training and validation data. The higher the values along the diagonal the better. The 2nd row and 2nd column cell for the validation data gives the number of true positive cases, which means around 36799 cases are correctly predicted as 1 (event) when they are known to be 1 (event). Similarly, 1 st row and 2nd column cell value on the validation data shows the count of false positive cases, that means approximately 18775 cases are misclassified as events (1), when actually they are known non-events (0).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-43
20. Right-click on the confusion matrix plot and select Lift chart. The lift plot summarizes the model’s ability to rank-order cases (blue line) relative to a naïve model (a horizontal line with an intercept of 1) and a best model (orange line). For example, when rank-ordered by the model, the top 15% (percentile) of the data has about 3.09 times as many responders as a random ordering of the data, and 1.91 times fewer responders than a perfect ordering of the training data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-44
Lesson 2 Machine Learning Algorithms
21. Right-click in the center of cumulative lift chart and select ROC to switch to a ROC curve. The ROC curve summarizes the true positive rate (Sensitivity) and false positive rate (1 – Specificity) across thresholds or cutoffs in the data. The 45-degree line represents the performance of the naïve model, and the vertical dashed line corresponds to an optimal threshold in the data.
22. Place your cursor at the top of the Max Separation line where it intersects the ROC curve. A tooltip reveals an optimal cutoff value of 0.20. 23. Again, right-click the ROC curve to switch to Misclassification plot.
The true positive frequency (validation data) is 36,799 at the default prediction cutoff value of 0.50. 24. Click the Options tab. In the Neural Network section, click the Assessment property to change the Prediction cutoff value from the default of 0.5 to 0.2. 25. Click Refresh. Examine the Misclassification plot again, and you find that the number of true positives has more than doubled, to 83,429. Note: Change in prediction cutoff will also change values in the confusion matrix.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
26. Click Maximize
2-45
on the top right of your page for additional details about the fitted model.
The Model Information provides summary details about the model’s architecture. There are 130 weights estimated in the chosen architecture for the 14 variables assigned as predictors.
Misclassification details give confusion matrix information about both the training and validation data sets. It clearly indicates that the model does fairly well on predicting training and validation cases.
27. Save the report. Click
(Menu) and select Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-46
Lesson 2 Machine Learning Algorithms
2.01 Multiple Choice Question The size of the nodes in the Network diagram is determined by which of the following? a. b.
number of hidden layers number of input variables
c. d.
number of neurons magnitude of parameter estimates
42 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Neural Networks
2-47
Practice
In this practice, you continue to work with the PVA_PARTITION data set to train a neural network model. The model aims to classify those customers who made a donation. 1. Training a Neural Network Model in SAS Visual Data Mining and Machine Learning a. Return to your remote desktop client machine. If your session timed out, sign in using the information provided by your instructor. b. Open your saved report, Exercise 1, which was created in the practice in Lesson 1. c. Select the Data pane on the left of the canvas and open the PVA_PARTITION data source. If you have not done so already, in the Measure column, right-click Target Gift Flag and select Category. d. Create a new page. e. Add a neural network to the canvas. f. Disable auto-refresh on the menu bar (if not done already). g. Add Target Gift Flag as the response. h. Under Predictors, click Add. In the Add Data Items window, select all predictor variables except for these five: • Control Number • Demographic Cluster • Partition • Target Gift Amount • Target Gift Amount with Zero (In all, you add 24 predictors.) i. Create the neural network model by clicking Refresh or enabling auto-refresh. • How many observations are used by algorithm? • Why all observations are not used by algorithm? • What is the misclassification rate for the model created with default settings ? j. Select the Options pane on the right and change Optimization Method to SGD. Do you see any improvement in the misclassification rate? k. Perform honest assessment and examine the results. 1) Select the Data pane on the lef t of the canvas and set the Partition variable as a new partition. 2) Select the Roles pane on the right of the canvas and assign the Partition variable under the Partition ID role. Ref resh the model and note the validation misclassification rate. 3) Select the Options pane and change the L2 regularization parameter value to 0.001. Under Hidden Layers, change Number of Hidden Layers property to 2. Do these changes result in any improvement in the validation misclassification rate statistics? 4) Examine the validation cumulative lif t chart. What can you determine about the top 10% (percentile) of the data? How does this model compare to the Best model?
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-48
Lesson 2 Machine Learning Algorithms
2. (Optional) Performing Autotuning to Determine Optimal Model Parameters a. Create a new page and add a neural network object on to the canvas. b. Add Target Gift Flag as the response. c. Under Predictors, click Add. In the Add Data Items window, select all predictor variables except for these four: • Control Number • Demographic Cluster • Target Gift Amount • Target Gift Amount with Zero (In all, you add 24 predictors.) d. In the Roles pane assign the Partition variable under the Partition ID role. e. In the Options pane, select the Autotune property and use the def ault autotune hyperparameters. Update the model (processing might take several minutes). f. Examine the optimal values selected f or • Number of hidden layers • Number of neurons • L1 and L2 g. Did you notice any improvement in the validation misclassification rate compared to the previous model? h. Use a validation cumulative lif t chart to compare this model with a previous model at the top 10% (percentile) of the data. i. Save the report. Click
(Menu) and select Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
2.3 Support Vector Machines Objectives •
Discuss the basics of support vector machines (SVMs).
Discuss options related to SVMs in SAS Visual Data Mining and Machine Learning. • Train a support vector machines model. •
•
Perform model exploration.
47 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-49
2-50
Lesson 2 Machine Learning Algorithms
Tasks for SVMs
SVM
Classification
Regression
48 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In general, support vector machines (SVMs) are used for classification and regression tasks. Until recently, support vector machines were implemented for classification tasks only in Visual Data Mining and Machine Learning. A support vector machine in Visual Data Mining and Machine Learning is a machine learning model that is used to perform classification by constructing a set of hyperplanes that maximizes the margin between two classes.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
Classification: Starting Point •
Training data set: patients with known diagnoses
•
Input variables:
data about patients
xi R d •
Response variable: two diseases
yi + 1,−1
49 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
For mathematical convenience, the binary target is defined by values +1 and -1, rather than the usual 1 and 0.
Classification •
Classification function:
•
Diagnosis = f(new patient)
f : R d + 1,−1
New case
? 50 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-51
2-52
Lesson 2 Machine Learning Algorithms
How to Classify Red Versus Green?
51 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In this simple illustration, the goal is to classify red versus green. There are many classification rules (lines) that could be used to perfectly separate the red (upper left in the slide) and green (lower right) cases.
Linear Separation of the Training Data •
A separating hyperplane H is given by •
the normal vector w,
•
an additional parameter, b, called bias.
H = x | w, x + b = 0
{
W
Dot product
H 52 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
This is a simple linear problem to start with. Later, you see a more complex nonlinear problem.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
Training versus Prediction •
Training: Select w and b in such a way that the hyperplane separates the training data—that is, construction of a hyperplane.
•
Prediction of the class for a new patient: On which side of the hyperplane is the new data point located?
W
H 53 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Data points located in the direction of the normal vector are diagnosed as positive. Data points on the other side of the hyperplane are diagnosed as negative.
Which Hyperplane Is the Best One?
54 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
If the data points are linearly separable, then inf initely many separating hyperplanes (that is, classif ication rules) exist. But the question is: Which hyperplane is the best one?
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-53
2-54
Lesson 2 Machine Learning Algorithms
A “Fat” Hyperplane …
55 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The starting point to get to a unique solution is to think of a “fat” hyperplane. This leads to a separator that has the largest margin of error on either side.
A Maximum-Margin Hyperplane
H
56 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Among all these hyperplanes (H), one of them has the maximum margin. It is essentially the median of the fat hyperplane.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
A Maximum-Margin Hyperplane
2 || w || 57 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The width of this maximum margin hyperplane is determined by the usual calculation of a point to a line. In the slide above, ||w|| is the norm of the vector w, where ||w||= sqrt(w’w). The norm of a vector is a measure of length.
Training Data Not Linearly Separable •
Penalty:
C * (Distance to hyperplane)
ξ
H
•
C is an error weight (regularization58parameter). C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
If the data points are not linearly separable, we have a so-called soft margin hyperplane.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-55
2-56
Lesson 2 Machine Learning Algorithms
What Are the Support Vectors? •
“Carrying vectors”
The points, located closest to the hyperplane • Determining the location of the hyperplane •
•
All other data points have i = 0 .
# sv
w = i yi xisv i =1
59
αi is Lagrangian multiplier
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
...
The properties of the maximum-margin hyperplane are described by the support vectors. The construction of the maximum-margin hyperplane is not explicitly dependent on the dimension of the input space. In the illustration above, only the five points that are the carrying vectors are used to determine w. Note: Refer to Appendix A for further mathematical details about support vectors and the Lagrangian approach.
Problem: Not Linearly Separable Data Points Input space 2-D
60 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
Idea: Feature Space Input space 2-D
•
Feature space is a nonlinear transformation of the input variables into a high-dimensional space.
•
The maximum-margin hyperplane is constructed in the high-dimensional feature space.
61 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Solution: The Kernel Trick Input space 2-D
Feature space
Linear separation with dot product
Nonlinear separation with kernel function
62 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Here is an example that is not linearly separable in two dimensions, but it is easy to separate in three dimensions. This can be generalized to higher dimensions.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-57
2-58
Lesson 2 Machine Learning Algorithms
Visual Data Mining and Machine Learning Kernel Functions K ( xi , x j ) = xi , x j
•
Linear
•
Polynomial
(
K ( xi , x j ) = xi , x j + k
)
d
Note: Available polynomial degrees are quadratic or cubic (d=2 or 3).
= gain or scale k = bias 63 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
These are the kernel functions that are now available in the SVM object of Visual Data Mining and Machine Learning. Why do we call it a trick? You do not have to know exactly what the feature space looks like. It is enough to specify the kernel function as a measure of similarity. You do not perform the exact kernel calculations but consider the result. Still, you have the geometric interpretation in the form of a separating hyperplane (that is, more transparency than for a neural network).
Summary of SVM An SVM is a hyperplane with a maximum-margin in a feature space, constructed by use of a kernel function in the input space.
64 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
Parameters for SVMs for Classification •
The penalty C (regularization term)
•
The kernel function and its parameters
65 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Advantages of SVMs •
Finds a global, unique minimum
The kernel trick • A simple geometric interpretation •
• •
Strong ability to generalize The complexity of the calculations not dependent on the dimension of the input space; avoids the curse of dimensionality
66 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-59
2-60
Lesson 2 Machine Learning Algorithms
Disadvantages of SVMs •
Which kernel function?
•
How to select the parameters of the kernel function?
67 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
2.02 Multiple Choice Question A kernel is essentially a mapping function that does which of the following? a. b. c. d.
transforms a low dimensional input space to some other higher dimensional space (feature space) transforms a higher dimensional space (feature space) to low dimensional input space separates events and non-events cases maximizes the separating distance
68 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
2-61
Support Vector Machine: Data Roles •
Response – one category variable
Predictors – at least one measure or category variable • Partition ID – only one partition variable •
70 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A support vector machine requires that a categorical variable be assigned as the response. If you assign a category variable with more than two levels, then by def ault it chooses the last level (alphanumerically) as the event level. If you want to select any other level to model as the event level, then f rom the Event Level property in the Options pane, select Choose Select Event Level and then choose the desired level of interest. You can assign at least one measure or category variable under the Predictors role. If a partition variable is identif ied and the Partition ID data role is specif ied, then holdout validation can be perf ormed. Note: If you do not have the partition variable, then you can create the one using SAS Visual Analytics f unctionality as discussed earlier. Adding roles to the model automatically updates the model. If you do not want the model to be updated automatically when you add roles, select the Report menu f rom the top right corner and select Interface options Disable auto-refresh. Af ter you def ine all the roles, you can update the model by clicking Interface options Enable auto-refresh f rom the Report menu or clicking (Refresh).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-62
Lesson 2 Machine Learning Algorithms
Support Vector Machine: Options General Event level • Autotune •
• •
Kernel function Include missing
Penalty value • Tolerance value •
Maximum iterations • Standardize measure predictors •
71 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
General • Event level – enables you to choose the event level of interest. When your category response variable contains more than two levels, SAS Visual Data Mining and Machine Learning treats all observations in the level of interest as an event and all other observations as nonevents. • Autotune – enables you to specify the constraints that control the autotuning algorithm. The constraints determine how long the algorithm can run, how many times the algorithm can run, and how many model evaluations are allowed. When you are unsure what to use f or model settings, autotuning can be used to f ind optimal values f or model hyperparameters . The autotuning algorithm selects the values for Kernel function and Penalty value that produce the best model. • Include missing – specifies whether observations with missing values are included in the model. For category predictors, missing values are assigned their own measurement levels. For measure predictors, missing values are imputed with the measure variable's mean. • Penalty value – specifies the penalty value. The penalty value balances model complexity and training error. A larger penalty value creates a more robust model at the risk of overfitting the training data. • Tolerance value – specifies a custom tolerance value for model training. The tolerance value balances the number of support vectors and model accuracy. A tolerance value that is too large creates too few support vectors, and a value that is too small overfits the training data. • Maximum Iterations – specifies a custom number of iterations for model training. • Standardize measure predictors – specifies whether interval variables are standardized.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
2-63
Support Vector Machine: Options Assessment Number of bins • Prediction cutoff •
• •
Statistic Percentile Tolerance
72 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Assessment • Number of bins – specifies the number of bins to use in the lift calculations in the Assessment plot. By default, set at 20. However, you can enter your own number of bins if desired. Increasing the number of bins increases the accuracy of the assessment at the expense of computing time • Prediction cutoff – specifies the cutoff value at which a computed probability is considered an event. By default, it is set at 0.5. Changing the default value affects the misclassification rate. • Statistic percentile – specifies the depth for the percentile bins to calculate the observed average, lift, cumulative lift, cumulative percentage captured, cumulative percentage events, and gain. • Tolerance – specifies the tolerance value that is used to determine the convergence of the iterative algorithm that estimates the percentiles. Specify a smaller value to increase the algorithmic precision.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-64
Lesson 2 Machine Learning Algorithms
Support Vector Machine: Model Display Options General Plot layout • Statistics to show •
Partial Dependence Plot • Show partial dependence X axis Assessment Plots •
Plot to show • Y axis •
73 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Model Display properties control the output layout and plots and statistics to show. The following display properties are available for the support vector machine: General • Plot layout – specifies how the results windows are displayed on the canvas. Fit aligns all of the objects on the canvas automatically. Stack displays the objects as if they are in a slide deck. Only one object is displayed at a time. When Stack is specified, a control bar lets you move between objects. • Statistic to show – specifies which assessment statistic to display in the model. Partial Dependence Plot •
Show partial dependence- specifies whether Partial Dependence plot is displayed.
•
X axis- specifies which variable is displayed in the Partial Dependence plot.
Assessment Plots • Plot to show – specifies which assessment plot to show. • Y axis – specifies which statistic is plotted in the lift plot.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
2-65
Support Vector Machine: Results Summary bar
Assessment plot
Relative Importance plot
74 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
By def ault, support vector machine model results window displays Relative Importance plot and Conf usion matrix all together on one canvas. You can also request f or Partial Dependence plot by making selection in Options pane. Change the Plot Layout option to Stack to place each result on its own canvas with tabbed access.
Analyzing Support Vector Machine Results Two panes appear on the canvas that help you analyze the results of the support vector machine model. Relative importance plot – plots the importance value of each input variable. • Assessment plot – helps determine how well the model fits the data. •
75 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-66
Lesson 2 Machine Learning Algorithms
Support Vector Machine Results: Relative Importance Plot •
The Relative Importance Plot window shows the importance value of each input variable.
76 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Relative Importance plot charts the importance value of each input variable. The variables are ranked using their f irst-split log worth when applied to the scored training data. The Relative Importance plot can be empty if no inp uts are determined to be important.
Support Vector Machine Results : Assessment Plots Four plots are available to show in this window pane: •
The Assessment window defaults to a confusion matrix chart. A ROC chart, lift chart and misclassification plot are also available.
77 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Assessment window enables you to choose to display a conf usion matrix, lift chart, ROC chart, or misclassif ication plot. To switch to another plot, right-click in the center of the Assessment window and select the desired plot.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
Details Table: Support Vector Machine •
•
The Details Table pane provides detailed statistics about the model via the different tabs, which are model dependent. •
Model Information
•
Iteration History
•
Training
•
Fit Statistics
•
Confusion Matrix
•
Lift
•
ROC
•
Misclassification
•
Assessment Statistics
To display the details table, click Maximize mode.
on the object toolbar to enter
78 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
• Model Information – a brief description of the settings used to create the model • Iteration History – the complementarity and feasibility statistics for each iteration • Training – a brief description of the components of the model, including number of support vectors, bias value, and inner product of weights, among the others • Fit Statistics – the values of various fit statistics computed for the model • Confusion Matrix – a summary of the correct and incorrect classifications for the model • Lift – the binned assessment results that are used to generate the lift plot • Misclassification – a summary of the correct and incorrect classifications for the model • ROC – Sensitivity and 1-Specificity values calculated at each cutoff value to generate the ROC plot • Assessment Statistics – the value of any assessment statistics computed for the model To hide the Details table, click the
icon in the upper right of the object page.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-67
2-68
Lesson 2 Machine Learning Algorithms
Training and Exploring an SVM Model in SAS Visual Data Mining and Machine Learning This demonstration illustrates the training and exploration of an SVM model in SAS Visual Data Mining and Machine Learning. 1. Open the Model Fitting report from the location My Folder My Tasks, if it is not already open. 2. In the upper right corner of the neural network page, click (More) to access the list of options, including Duplicate and Duplicate as. Press the Alt key. You notice that the two options mentioned have changed to Duplicate on new page and Duplicate on new page as. Continue to hold down the Alt key and select Duplicate on new page as Support Vector Machine. 3. Select the new Support Vector Machine on the canvas of Page 5 to make it the active object. The entire neural network model is now duplicated as a support vector machine. This saves you the effort of assigning the same variables under the response and predictors role. 4. For convenience, rename Page 5 (newly created page) to SVM. 5. Disable the auto-refresh option, if not done already.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
The Data Roles pane should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-69
2-70
Lesson 2 Machine Learning Algorithms
6. Click the Options tab and confirm that the default settings resemble those shown below. Note that the default kernel function is Linear.
7. Click the Refresh button to create an SVM model.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
2-71
8. See the results at the top of the workspace. The summary bar displays the target variable and the modeled level of interest (that is, event=1), the assessment statistics, and the number of observations used. To investigate other statistics, click Validation KS (Youden) and select Validation Misclassification Rate (Event).
Note: Some SAS Visual Data Mining and Machine Learning models are created with a nondeterministic process. This means that you might experience dif ferent results when opening the report again later.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-72
Lesson 2 Machine Learning Algorithms
Exploration of the SVM Model To improve model performance and make the model more robust or generalized (to avoid overfitting), you can change some of the model parameters. 1. Click the Options tab. In the Support Vector Machine section, click General and change the Kernel function property to Quadratic. 2. Click Refresh to update the model. The validation performance of the SVM is competitive with the model from the previous demonstration. Model comparison is explored in the next lesson.
The Cumulative Lift chart shows that at a depth of 0.15, the lift would be approximately 3.2. This means that if you targeted the top 15% of your customers based on the predicted probabilities as produced by your model, you would get 3.2 times as many responders as a random ordering of the data and approximately 1.8 times fewer responders than a perfect model.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
2-73
3. To switch to other assessment plots, right-click in the center of the lift plot and select ROC. The ROC chart summarizes the true positive rate (Sensitivity) and false positive rate (1 – Specificity) across thresholds or cutoffs in the data. The 45-degree line represents the performance of the naïve model, and the vertical dashed line corresponds to an optimal threshold in the data.
Similarly, the misclassification plot and confusion matrix can be examined to check how many observations were correctly and incorrectly classified for each value of the response variable.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-74
Lesson 2 Machine Learning Algorithms
4. Click Maximize
on the top right of your page for additional details about the fitted model.
Model information lists restrictions under which model fitting occurs.
The Training details table displays the inner product of weights, bias, and the number of support vectors among the other components of the model.
Misclassification details give correct and incorrect classifications of cases on both the training and validation data sets.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Support Vector Machines
Assessment statistics summarize the number of observations used to calculate various assessment statistics for the SVM.
5. Save the report. Click
(Menu) and select Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-75
2-76
Lesson 2 Machine Learning Algorithms
Practice In this practice, you continue to use the PVA_PARTITION data set to train a support vector model to classify those customers who donated. 3. Training a Support Vector Machine Model in SAS Visual Data Mining and Machine Learning a. Return to your remote desktop client machine. If your session timed out, sign in using the information provided by your instructor. b. Open your saved report from the previous practice, Exercise 1. c. Create a new page and add an SVM object to the canvas. d. Add Target Gift Flag as the response. e. Under Predictors, click Add. In the Add Data Items window, select all predictor variables except for these four: • Control Number • Demographic Cluster • Target Gift Amount • Target Gift Amount with Zero (In all, you add 24 predictors.) f. Train the support vector machine model by enabling auto-refresh or clicking Refresh and report the misclassification rate (event) for the model created with default settings. g. Perform honest assessment and examine the results. 1) Select the Roles pane on the right of the canvas and assign the Partition variable under the Partition ID role. Refresh the model and note the validation misclassification rate (event). 2) Select the Options pane and change Kernel function to Quadratic. Do you notice any improvement in the validation Misclassification Rate (Event) statistics? How many support vectors are used by the model? 3) With the above setting, change Penalty value to 2. What change do you notice in the number of support vectors as the penalty value is increased? 4) Save the report. Click
(Menu) and select Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Forests
2.4 Forests Objectives •
Describe forest models.
Discuss options related to forest models in SAS Visual Data Mining and Machine Learning. • Train and explore the forest model. •
83 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Forest •
An ensemble model is an aggregation of more than one model where the final prediction of the model is a combination of the predictions from the component models of the ensemble.
A forest model is an ensemble of classification or regression trees. • The forest models were developed to overcome the instability that a single classification or regression tree exhibits with minor perturbations of the training data. •
84 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-77
2-78
Lesson 2 Machine Learning Algorithms
Seeing the Forest through the Trees Trees in the forest differ from each other in two ways: Training data for a tree is a sample with replacement from all observations. • Input variables considered for splitting a node are randomly selected from available inputs. Only the variable most associated with the target is split for that node. •
85 C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The trees that make up a forest differ from each other in two ways. • The training data for a tree is a sample with replacement from all observations that were originally training data for the forest. • The input variables considered for splitting for any given node are selected randomly from all available inputs. Among these variables, only the variable most associated with the target is used when forming a split. This means that each tree is created on a sample of the inputs and from a sample of observations. This process, repeated many, many times, creates a more stable model than a single tree. The reason for using a sample of the data to construct each tree is because when less than all available observations are used, the generalization error is often improved. A different sample is taken for each tree.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Forests
Leaves = Boolean Rules ^
If X1 {values} and X2 {values}, then Y=value. Leaf
X1
X2
Predicted Y
1