{
"cells": [
{
"cell_type": "markdown",
"id": "16be11e8",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "02d0018f",
"metadata": {},
"source": [
"# Regresión y Clasificación con Arboles de decisión"
]
},
{
"cell_type": "code",
"execution_count": 136,
"id": "6a6cf0c1",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import classification_report,confusion_matrix,ConfusionMatrixDisplay\n",
"import seaborn as sns\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.tree import DecisionTreeRegressor\n",
"from sklearn import model_selection as ms\n",
"from sklearn.metrics import r2_score \n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn import metrics as m"
]
},
{
"cell_type": "markdown",
"id": "ff149f41",
"metadata": {},
"source": [
"# 1. Clasificación"
]
},
{
"cell_type": "markdown",
"id": "c4e5a4f0",
"metadata": {},
"source": [
"### Información del dataset\n",
"\n",
"### Adult income dataset\n",
"\n",
"https://archive.ics.uci.edu/ml/datasets/Adult\n",
"\n",
"\n",
"1. edad: continua.\n",
"2. clase de trabajo: Privado, Autónomo-no-inc, Autónomo-inc, Federal-gov, Local-gov, Estatal-gov, Sin-trabajo, Nunca-trabajo.\n",
"3. fnlwgt: continuo.\n",
"4. educación: Bachillerato, Algún tipo de universidad, 11º, Grado de secundaria, Escuela profesional, Asistente de dirección, Asistente de dirección, 9º, 7º-8º, 12º, Máster, 1º-4º, 10º, Doctorado, 5º-6º, Preescolar.\n",
"5. education-num: continuous.\n",
"6. estado civil: Casado-cónyuge, Divorciado, No casado, Separado, Viudo, Casado-cónyuge-ausente, Casado-cónyuge.\n",
"7. Ocupación: Apoyo técnico, Reparación artesanal, Otros servicios, Ventas, Directivo, Especialidad profesional, Manipulador-limpiador, Operador de maquinaria, Administrativo-empleado, Agricultura-pesca, Transporte-movimiento, Servicio doméstico privado, Servicio de protección, Fuerzas armadas.\n",
"8. Relación: Esposa, Hijo propio, Esposo, No familiar, Otro pariente, Soltero.\n",
"9. Raza: Blanco, Asiático-Pacífico-Islandés, Amerindio-Esquimal, Otro, Negro.\n",
"10. Sexo: Mujer, Hombre.\n",
"11. plusvalía: continua.\n",
"12. pérdida de capital: continua.\n",
"13. horas-semana: continuo.\n",
"14. país de origen: Estados Unidos, Camboya, Inglaterra, Puerto-Rico, Canadá, Alemania, Estados Unidos periféricos (Guam-USVI-etc), India, Japón, Grecia, Sur, China, Cuba, Irán, Honduras, Filipinas, Italia, Polonia, Jamaica, Vietnam, México, Portugal, Irlanda, Francia, República Dominicana, Laos, Ecuador, Taiwán, Haití, Colombia, Hungría, Guatemala, Nicaragua, Escocia, Tailandia, Yugoslavia, El Salvador, Trinad&Tobago, Perú, Hong, Holanda.\n",
"15. clase: >50K, <=50K"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5c686ceb",
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv('resources/adult_.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8f164d20",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" workclass | \n",
" fnlwgt | \n",
" education | \n",
" educational-num | \n",
" marital-status | \n",
" occupation | \n",
" relationship | \n",
" race | \n",
" gender | \n",
" capital-gain | \n",
" capital-loss | \n",
" hours-per-week | \n",
" native-country | \n",
" income | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 25 | \n",
" Private | \n",
" 226802 | \n",
" 11th | \n",
" 7 | \n",
" Never-married | \n",
" Machine-op-inspct | \n",
" Own-child | \n",
" Black | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" | 1 | \n",
" 38 | \n",
" Private | \n",
" 89814 | \n",
" HS-grad | \n",
" 9 | \n",
" Married-civ-spouse | \n",
" Farming-fishing | \n",
" Husband | \n",
" White | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 50 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" | 2 | \n",
" 28 | \n",
" Local-gov | \n",
" 336951 | \n",
" Assoc-acdm | \n",
" 12 | \n",
" Married-civ-spouse | \n",
" Protective-serv | \n",
" Husband | \n",
" White | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" >50K | \n",
"
\n",
" \n",
" | 3 | \n",
" 44 | \n",
" Private | \n",
" 160323 | \n",
" Some-college | \n",
" 10 | \n",
" Married-civ-spouse | \n",
" Machine-op-inspct | \n",
" Husband | \n",
" Black | \n",
" Male | \n",
" 7688 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" >50K | \n",
"
\n",
" \n",
" | 4 | \n",
" 18 | \n",
" ? | \n",
" 103497 | \n",
" Some-college | \n",
" 10 | \n",
" Never-married | \n",
" ? | \n",
" Own-child | \n",
" White | \n",
" Female | \n",
" 0 | \n",
" 0 | \n",
" 30 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 48837 | \n",
" 27 | \n",
" Private | \n",
" 257302 | \n",
" Assoc-acdm | \n",
" 12 | \n",
" Married-civ-spouse | \n",
" Tech-support | \n",
" Wife | \n",
" White | \n",
" Female | \n",
" 0 | \n",
" 0 | \n",
" 38 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" | 48838 | \n",
" 40 | \n",
" Private | \n",
" 154374 | \n",
" HS-grad | \n",
" 9 | \n",
" Married-civ-spouse | \n",
" Machine-op-inspct | \n",
" Husband | \n",
" White | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" >50K | \n",
"
\n",
" \n",
" | 48839 | \n",
" 58 | \n",
" Private | \n",
" 151910 | \n",
" HS-grad | \n",
" 9 | \n",
" Widowed | \n",
" Adm-clerical | \n",
" Unmarried | \n",
" White | \n",
" Female | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" | 48840 | \n",
" 22 | \n",
" Private | \n",
" 201490 | \n",
" HS-grad | \n",
" 9 | \n",
" Never-married | \n",
" Adm-clerical | \n",
" Own-child | \n",
" White | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 20 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" | 48841 | \n",
" 52 | \n",
" Self-emp-inc | \n",
" 287927 | \n",
" HS-grad | \n",
" 9 | \n",
" Married-civ-spouse | \n",
" Exec-managerial | \n",
" Wife | \n",
" White | \n",
" Female | \n",
" 15024 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" >50K | \n",
"
\n",
" \n",
"
\n",
"
48842 rows × 15 columns
\n",
"
"
],
"text/plain": [
" age workclass fnlwgt education educational-num \\\n",
"0 25 Private 226802 11th 7 \n",
"1 38 Private 89814 HS-grad 9 \n",
"2 28 Local-gov 336951 Assoc-acdm 12 \n",
"3 44 Private 160323 Some-college 10 \n",
"4 18 ? 103497 Some-college 10 \n",
"... ... ... ... ... ... \n",
"48837 27 Private 257302 Assoc-acdm 12 \n",
"48838 40 Private 154374 HS-grad 9 \n",
"48839 58 Private 151910 HS-grad 9 \n",
"48840 22 Private 201490 HS-grad 9 \n",
"48841 52 Self-emp-inc 287927 HS-grad 9 \n",
"\n",
" marital-status occupation relationship race gender \\\n",
"0 Never-married Machine-op-inspct Own-child Black Male \n",
"1 Married-civ-spouse Farming-fishing Husband White Male \n",
"2 Married-civ-spouse Protective-serv Husband White Male \n",
"3 Married-civ-spouse Machine-op-inspct Husband Black Male \n",
"4 Never-married ? Own-child White Female \n",
"... ... ... ... ... ... \n",
"48837 Married-civ-spouse Tech-support Wife White Female \n",
"48838 Married-civ-spouse Machine-op-inspct Husband White Male \n",
"48839 Widowed Adm-clerical Unmarried White Female \n",
"48840 Never-married Adm-clerical Own-child White Male \n",
"48841 Married-civ-spouse Exec-managerial Wife White Female \n",
"\n",
" capital-gain capital-loss hours-per-week native-country income \n",
"0 0 0 40 United-States <=50K \n",
"1 0 0 50 United-States <=50K \n",
"2 0 0 40 United-States >50K \n",
"3 7688 0 40 United-States >50K \n",
"4 0 0 30 United-States <=50K \n",
"... ... ... ... ... ... \n",
"48837 0 0 38 United-States <=50K \n",
"48838 0 0 40 United-States >50K \n",
"48839 0 0 40 United-States <=50K \n",
"48840 0 0 20 United-States <=50K \n",
"48841 15024 0 40 United-States >50K \n",
"\n",
"[48842 rows x 15 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "markdown",
"id": "58b44e18",
"metadata": {},
"source": [
"### Tarea\n",
"\n",
" Predecir si los ingresos exceden los $50K/año según los datos del censo. También conocido como conjunto de datos de \"Ingresos del censo\". \n",
" \n",
"Tenemos dos clases: \n",
"\n",
"1. .<=50k\n",
"2. .>50k\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"id": "7e7e50de",
"metadata": {},
"source": [
"### 1. Análisis exploratorio de los datos"
]
},
{
"cell_type": "markdown",
"id": "e4b0f2b8",
"metadata": {},
"source": [
"1. Imprima el número de registros del dataset\n",
"2. Imprima el número de variables del dataset\n",
"3. Imprima el nombre de las columnas del dataset\n",
"4. Imprima el **head** del dataset\n",
"5. Imprima el **tail** del dataset\n",
"6. Imprima **info** basica del dataset\n",
"7. Imprima un **describe** del dataset\n",
"8. Graficar la distribución de clases usando un diagrama de barras (Recomendación: Usar la librería seaborn).\n",
"9. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "dab623a3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de regristros 48842\n",
"Número de variables 15\n"
]
}
],
"source": [
"print(\"Número de regristros\",)\n",
"print(\"Número de variables\",)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "9649f77f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" workclass | \n",
" fnlwgt | \n",
" education | \n",
" educational-num | \n",
" marital-status | \n",
" occupation | \n",
" relationship | \n",
" race | \n",
" gender | \n",
" capital-gain | \n",
" capital-loss | \n",
" hours-per-week | \n",
" native-country | \n",
" income | \n",
"
\n",
" \n",
" \n",
" \n",
" | 48837 | \n",
" 27 | \n",
" Private | \n",
" 257302 | \n",
" Assoc-acdm | \n",
" 12 | \n",
" Married-civ-spouse | \n",
" Tech-support | \n",
" Wife | \n",
" White | \n",
" Female | \n",
" 0 | \n",
" 0 | \n",
" 38 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" | 48838 | \n",
" 40 | \n",
" Private | \n",
" 154374 | \n",
" HS-grad | \n",
" 9 | \n",
" Married-civ-spouse | \n",
" Machine-op-inspct | \n",
" Husband | \n",
" White | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" >50K | \n",
"
\n",
" \n",
" | 48839 | \n",
" 58 | \n",
" Private | \n",
" 151910 | \n",
" HS-grad | \n",
" 9 | \n",
" Widowed | \n",
" Adm-clerical | \n",
" Unmarried | \n",
" White | \n",
" Female | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" | 48840 | \n",
" 22 | \n",
" Private | \n",
" 201490 | \n",
" HS-grad | \n",
" 9 | \n",
" Never-married | \n",
" Adm-clerical | \n",
" Own-child | \n",
" White | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 20 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" | 48841 | \n",
" 52 | \n",
" Self-emp-inc | \n",
" 287927 | \n",
" HS-grad | \n",
" 9 | \n",
" Married-civ-spouse | \n",
" Exec-managerial | \n",
" Wife | \n",
" White | \n",
" Female | \n",
" 15024 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" >50K | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age workclass fnlwgt education educational-num \\\n",
"48837 27 Private 257302 Assoc-acdm 12 \n",
"48838 40 Private 154374 HS-grad 9 \n",
"48839 58 Private 151910 HS-grad 9 \n",
"48840 22 Private 201490 HS-grad 9 \n",
"48841 52 Self-emp-inc 287927 HS-grad 9 \n",
"\n",
" marital-status occupation relationship race gender \\\n",
"48837 Married-civ-spouse Tech-support Wife White Female \n",
"48838 Married-civ-spouse Machine-op-inspct Husband White Male \n",
"48839 Widowed Adm-clerical Unmarried White Female \n",
"48840 Never-married Adm-clerical Own-child White Male \n",
"48841 Married-civ-spouse Exec-managerial Wife White Female \n",
"\n",
" capital-gain capital-loss hours-per-week native-country income \n",
"48837 0 0 38 United-States <=50K \n",
"48838 0 0 40 United-States >50K \n",
"48839 0 0 40 United-States <=50K \n",
"48840 0 0 20 United-States <=50K \n",
"48841 15024 0 40 United-States >50K "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f8f8bb9f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 48842 entries, 0 to 48841\n",
"Data columns (total 15 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 age 48842 non-null int64 \n",
" 1 workclass 48842 non-null object\n",
" 2 fnlwgt 48842 non-null int64 \n",
" 3 education 48842 non-null object\n",
" 4 educational-num 48842 non-null int64 \n",
" 5 marital-status 48842 non-null object\n",
" 6 occupation 48842 non-null object\n",
" 7 relationship 48842 non-null object\n",
" 8 race 48842 non-null object\n",
" 9 gender 48842 non-null object\n",
" 10 capital-gain 48842 non-null int64 \n",
" 11 capital-loss 48842 non-null int64 \n",
" 12 hours-per-week 48842 non-null int64 \n",
" 13 native-country 48842 non-null object\n",
" 14 income 48842 non-null object\n",
"dtypes: int64(6), object(9)\n",
"memory usage: 5.6+ MB\n"
]
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9a48bedb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" count | \n",
" mean | \n",
" std | \n",
" min | \n",
" 25% | \n",
" 50% | \n",
" 75% | \n",
" max | \n",
"
\n",
" \n",
" \n",
" \n",
" | age | \n",
" 48842.0 | \n",
" 38.643585 | \n",
" 13.710510 | \n",
" 17.0 | \n",
" 28.0 | \n",
" 37.0 | \n",
" 48.0 | \n",
" 90.0 | \n",
"
\n",
" \n",
" | fnlwgt | \n",
" 48842.0 | \n",
" 189664.134597 | \n",
" 105604.025423 | \n",
" 12285.0 | \n",
" 117550.5 | \n",
" 178144.5 | \n",
" 237642.0 | \n",
" 1490400.0 | \n",
"
\n",
" \n",
" | educational-num | \n",
" 48842.0 | \n",
" 10.078089 | \n",
" 2.570973 | \n",
" 1.0 | \n",
" 9.0 | \n",
" 10.0 | \n",
" 12.0 | \n",
" 16.0 | \n",
"
\n",
" \n",
" | capital-gain | \n",
" 48842.0 | \n",
" 1079.067626 | \n",
" 7452.019058 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 99999.0 | \n",
"
\n",
" \n",
" | capital-loss | \n",
" 48842.0 | \n",
" 87.502314 | \n",
" 403.004552 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 4356.0 | \n",
"
\n",
" \n",
" | hours-per-week | \n",
" 48842.0 | \n",
" 40.422382 | \n",
" 12.391444 | \n",
" 1.0 | \n",
" 40.0 | \n",
" 40.0 | \n",
" 45.0 | \n",
" 99.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count mean std min 25% \\\n",
"age 48842.0 38.643585 13.710510 17.0 28.0 \n",
"fnlwgt 48842.0 189664.134597 105604.025423 12285.0 117550.5 \n",
"educational-num 48842.0 10.078089 2.570973 1.0 9.0 \n",
"capital-gain 48842.0 1079.067626 7452.019058 0.0 0.0 \n",
"capital-loss 48842.0 87.502314 403.004552 0.0 0.0 \n",
"hours-per-week 48842.0 40.422382 12.391444 1.0 40.0 \n",
"\n",
" 50% 75% max \n",
"age 37.0 48.0 90.0 \n",
"fnlwgt 178144.5 237642.0 1490400.0 \n",
"educational-num 10.0 12.0 16.0 \n",
"capital-gain 0.0 0.0 99999.0 \n",
"capital-loss 0.0 0.0 4356.0 \n",
"hours-per-week 40.0 45.0 99.0 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 9,
"id": "7b0f44a1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" workclass | \n",
" education | \n",
" marital-status | \n",
" occupation | \n",
" relationship | \n",
" race | \n",
" gender | \n",
" native-country | \n",
" income | \n",
"
\n",
" \n",
" \n",
" \n",
" | count | \n",
" 48842 | \n",
" 48842 | \n",
" 48842 | \n",
" 48842 | \n",
" 48842 | \n",
" 48842 | \n",
" 48842 | \n",
" 48842 | \n",
" 48842 | \n",
"
\n",
" \n",
" | unique | \n",
" 9 | \n",
" 16 | \n",
" 7 | \n",
" 15 | \n",
" 6 | \n",
" 5 | \n",
" 2 | \n",
" 42 | \n",
" 2 | \n",
"
\n",
" \n",
" | top | \n",
" Private | \n",
" HS-grad | \n",
" Married-civ-spouse | \n",
" Prof-specialty | \n",
" Husband | \n",
" White | \n",
" Male | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" | freq | \n",
" 33906 | \n",
" 15784 | \n",
" 22379 | \n",
" 6172 | \n",
" 19716 | \n",
" 41762 | \n",
" 32650 | \n",
" 43832 | \n",
" 37155 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" workclass education marital-status occupation relationship \\\n",
"count 48842 48842 48842 48842 48842 \n",
"unique 9 16 7 15 6 \n",
"top Private HS-grad Married-civ-spouse Prof-specialty Husband \n",
"freq 33906 15784 22379 6172 19716 \n",
"\n",
" race gender native-country income \n",
"count 48842 48842 48842 48842 \n",
"unique 5 2 42 2 \n",
"top White Male United-States <=50K \n",
"freq 41762 32650 43832 37155 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 10,
"id": "94c1f279",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['<=50K', '>50K'], dtype=object)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 11,
"id": "69bb63c2",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\ProgramData\\Anaconda3\\lib\\site-packages\\seaborn\\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZIAAAEGCAYAAABPdROvAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAX+ElEQVR4nO3df7BfdX3n8eeLhCJbBfkRbUygYSTOFqiGJUaq3RkrbUnZ7gY7oNetEm12YhnY0d1OR/APtd3JrMxqU3EKM3ERAquFFKtEB7qyoKuuSLxYJIQfw91CIZJCEISwXdgmvPeP7+eO31y+ublw8r2Xm/t8zJy55/s+53Pu52Qu8+JzPud7TqoKSZJerkNmugOSpNnNIJEkdWKQSJI6MUgkSZ0YJJKkTubPdAem27HHHltLliyZ6W5I0qxyxx13PFFVCwZtm3NBsmTJEkZHR2e6G5I0qyT5+31t89KWJKkTg0SS1IlBIknqxCCRJHVikEiSOjFIJEmdGCSSpE4MEklSJwaJJKmTOffN9gPhtD++eqa7oFegO/7LeTPdBWlGOCKRJHVikEiSOjFIJEmdGCSSpE4MEklSJwaJJKkTg0SS1IlBIknqZGhBkuRVSbYk+XGSbUn+pNU/leQnSe5sy1l9bS5OMpbk/iRn9tVPS7K1bbs0SVr9sCTXtfrtSZYM63wkSYMNc0TyPPCuqnoLsAxYmeT0tm19VS1ry40ASU4CRoCTgZXAZUnmtf0vB9YCS9uystXXAE9V1YnAeuCSIZ6PJGmAoQVJ9TzbPh7alpqkySrg2qp6vqoeBMaAFUkWAkdU1W1VVcDVwNl9bTa29euBM8ZHK5Kk6THUOZIk85LcCTwO3FxVt7dNFya5K8kXkxzVaouAR/qab2+1RW19Yn2vNlW1G3gaOGZAP9YmGU0yunPnzgNzcpIkYMhBUlV7qmoZsJje6OIUepep3kjvctcO4LNt90EjiZqkPlmbif3YUFXLq2r5ggULXtI5SJImNy13bVXVz4BvAyur6rEWMC8AXwBWtN22A8f1NVsMPNrqiwfU92qTZD5wJPDkcM5CkjTIMO/aWpDktW39cOA3gfvanMe4dwN3t/XNwEi7E+sEepPqW6pqB7Aryelt/uM84Ia+Nqvb+jnArW0eRZI0TYb5PpKFwMZ259UhwKaq+kaSa5Iso3cJ6iHgwwBVtS3JJuAeYDdwQVXtacc6H7gKOBy4qS0AVwDXJBmjNxIZGeL5SJIGGFqQVNVdwKkD6h+YpM06YN2A+ihwyoD6c8C53XoqSerCb7ZLkjoxSCRJnRgkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnRgkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnQwtSJK8KsmWJD9Osi3Jn7T60UluTvJA+3lUX5uLk4wluT/JmX3105JsbdsuTZJWPyzJda1+e5IlwzofSdJgwxyRPA+8q6reAiwDViY5HbgIuKWqlgK3tM8kOQkYAU4GVgKXJZnXjnU5sBZY2paVrb4GeKqqTgTWA5cM8XwkSQMMLUiq59n28dC2FLAK2NjqG4Gz2/oq4Nqqer6qHgTGgBVJFgJHVNVtVVXA1RPajB/reuCM8dGKJGl6DHWOJMm8JHcCjwM3V9XtwOuragdA+/m6tvsi4JG+5ttbbVFbn1jfq01V7QaeBo4ZyslIkgYaapBU1Z6qWgYspje6OGWS3QeNJGqS+mRt9j5wsjbJaJLRnTt37qfXkqSXYlru2qqqnwHfpje38Vi7XEX7+XjbbTtwXF+zxcCjrb54QH2vNknmA0cCTw74/RuqanlVLV+wYMGBOSlJEjDcu7YWJHltWz8c+E3gPmAzsLrtthq4oa1vBkbanVgn0JtU39Iuf+1Kcnqb/zhvQpvxY50D3NrmUSRJ02T+EI+9ENjY7rw6BNhUVd9IchuwKcka4GHgXICq2pZkE3APsBu4oKr2tGOdD1wFHA7c1BaAK4BrkozRG4mMDPF8JEkDDC1Iquou4NQB9Z8CZ+yjzTpg3YD6KPCi+ZWqeo4WRJKkmeE32yVJnRgkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnRgkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnRgkkqRODBJJUidDC5IkxyX5VpJ7k2xL8pFW/1SSnyS5sy1n9bW5OMlYkvuTnNlXPy3J1rbt0iRp9cOSXNfqtydZMqzzkSQNNswRyW7gj6rqV4DTgQuSnNS2ra+qZW25EaBtGwFOBlYClyWZ1/a/HFgLLG3LylZfAzxVVScC64FLhng+kqQBhhYkVbWjqn7U1ncB9wKLJmmyCri2qp6vqgeBMWBFkoXAEVV1W1UVcDVwdl+bjW39euCM8dGKJGl6TMscSbvkdCpweytdmOSuJF9MclSrLQIe6Wu2vdUWtfWJ9b3aVNVu4GngmAG/f22S0SSjO3fuPDAnJUkCpiFIkrwa+Arw0ap6ht5lqjcCy4AdwGfHdx3QvCapT9Zm70LVhqpaXlXLFyxY8NJOQJI0qaEGSZJD6YXIl6rqrwGq6rGq2lNVLwBfAFa03bcDx/U1Xww82uqLB9T3apNkPnAk8ORwzkaSNMgw79oKcAVwb1X9WV99Yd9u7wbubuubgZF2J9YJ9CbVt1TVDmBXktPbMc8Dbuhrs7qtnwPc2uZRJEnTZP4Qj/0O4APA1iR3ttrHgfclWUbvEtRDwIcBqmpbkk3APfTu+Lqgqva0ducDVwGHAze1BXpBdU2SMXojkZEhno8kaYChBUlVfY/Bcxg3TtJmHbBuQH0UOGVA/Tng3A7dlCR15DfbJUmdGCSSpE4MEklSJwaJJKkTg0SS1IlBIknqxCCRJHVikEiSOjFIJEmdGCSSpE4MEklSJwaJJKkTg0SS1IlBIknqZEpBkuSWqdQkSXPPpO8jSfIq4J8BxyY5ip+/X+QI4A1D7pskaRbY34utPgx8lF5o3MHPg+QZ4C+G1y1J0mwxaZBU1eeAzyX591X1+WnqkyRpFpnSq3ar6vNJ3g4s6W9TVVcPqV+SpFliqpPt1wCfAX4deGtblu+nzXFJvpXk3iTbknyk1Y9OcnOSB9rPo/raXJxkLMn9Sc7sq5+WZGvbdmmStPphSa5r9duTLHmp/wCSpG6mNCKhFxonVVW9hGPvBv6oqn6U5DXAHUluBj4I3FJVn05yEXAR8LEkJwEjwMn05mT+R5I3VdUe4HJgLfAD4EZgJXATsAZ4qqpOTDICXAK89yX0UZLU0VS/R3I38Esv5cBVtaOqftTWdwH3AouAVcDGtttG4Oy2vgq4tqqer6oHgTFgRZKFwBFVdVsLsqsntBk/1vXAGeOjFUnS9JjqiORY4J4kW4Dnx4tV9W+m0rhdcjoVuB14fVXtaO13JHld220RvRHHuO2t9k9tfWJ9vM0j7Vi7kzwNHAM8MeH3r6U3ouH444+fSpclSVM01SD51Mv9BUleDXwF+GhVPTPJgGHQhpqkPlmbvQtVG4ANAMuXL38pl+ckSfsx1bu2/ufLOXiSQ+mFyJeq6q9b+bEkC9toZCHweKtvB47ra74YeLTVFw+o97fZnmQ+cCTw5MvpqyTp5ZnqXVu7kjzTlueS7EnyzH7aBLgCuLeq/qxv02ZgdVtfDdzQVx9pd2KdACwFtrTLYLuSnN6Oed6ENuPHOge49SXeECBJ6miqI5LX9H9OcjawYj/N3gF8ANia5M5W+zjwaWBTkjXAw8C57XdsS7IJuIfeHV8XtDu2AM4HrgIOp3e31k2tfgVwTZIxeiORkamcjyTpwJnqHMlequpr7dbdyfb5HoPnMADO2EebdcC6AfVR4JQB9edoQSRJmhlTCpIkv9f38RB63yvxEpIkacojkn/dt74beIjedzgkSXPcVOdIPjTsjkiSZqep3rW1OMlXkzye5LEkX0myeP8tJUkHu6k+IuVKerfavoHet8m/3mqSpDluqkGyoKqurKrdbbkKWDDEfkmSZompBskTSd6fZF5b3g/8dJgdkyTNDlMNkj8A3gP8A7CD3rfInYCXJE359t//BKyuqqeg93Iqei+6+oNhdUySNDtMdUTy5vEQAaiqJ+k9Fl6SNMdNNUgOmfBK3KN5mY9XkSQdXKYaBp8Fvp/kenqPRnkPA56JJUmae6b6zfark4wC76L3IMbfq6p7htozSdKsMOXLUy04DA9J0l6mOkciSdJABokkqRODRJLUiUEiSerEIJEkdTK0IEnyxfb+krv7ap9K8pMkd7blrL5tFycZS3J/kjP76qcl2dq2XZokrX5Ykuta/fYkS4Z1LpKkfRvmiOQqYOWA+vqqWtaWGwGSnASMACe3Npclmdf2vxxYCyxty/gx1wBPVdWJwHrgkmGdiCRp34YWJFX1HeDJKe6+Cri2qp6vqgeBMWBFkoXAEVV1W1UVcDVwdl+bjW39euCM8dGKJGn6zMQcyYVJ7mqXvsaf37UIeKRvn+2ttqitT6zv1aaqdgNPA8cM+oVJ1iYZTTK6c+fOA3cmkqRpD5LLgTcCy+i91+SzrT5oJFGT1Cdr8+Ji1YaqWl5Vyxcs8MWOknQgTWuQVNVjVbWnql4AvgCsaJu2A8f17boYeLTVFw+o79UmyXzgSKZ+KU2SdIBMa5C0OY9x7wbG7+jaDIy0O7FOoDepvqWqdgC7kpze5j/OA27oa7O6rZ8D3NrmUSRJ02ho7xRJ8pfAO4Fjk2wHPgm8M8kyepegHgI+DFBV25JsovdQyN3ABVW1px3qfHp3gB0O3NQWgCuAa5KM0RuJjAzrXCRJ+za0IKmq9w0oXzHJ/usY8I6TqhoFThlQfw44t0sfJUnd+c12SVInBokkqRODRJLUiUEiSerEIJEkdWKQSJI6MUgkSZ0YJJKkTgwSSVInBokkqRODRJLUiUEiSerEIJEkdTK0p/9Kmn4P/+mvznQX9Ap0/Ce2DvX4jkgkSZ0YJJKkTgwSSVInBokkqRODRJLUydCCJMkXkzye5O6+2tFJbk7yQPt5VN+2i5OMJbk/yZl99dOSbG3bLk2SVj8syXWtfnuSJcM6F0nSvg1zRHIVsHJC7SLglqpaCtzSPpPkJGAEOLm1uSzJvNbmcmAtsLQt48dcAzxVVScC64FLhnYmkqR9GlqQVNV3gCcnlFcBG9v6RuDsvvq1VfV8VT0IjAErkiwEjqiq26qqgKsntBk/1vXAGeOjFUnS9JnuOZLXV9UOgPbzda2+CHikb7/trbaorU+s79WmqnYDTwPHDPqlSdYmGU0yunPnzgN0KpIkeOVMtg8aSdQk9cnavLhYtaGqllfV8gULFrzMLkqSBpnuIHmsXa6i/Xy81bcDx/Xttxh4tNUXD6jv1SbJfOBIXnwpTZI0ZNMdJJuB1W19NXBDX32k3Yl1Ar1J9S3t8teuJKe3+Y/zJrQZP9Y5wK1tHkWSNI2G9tDGJH8JvBM4Nsl24JPAp4FNSdYADwPnAlTVtiSbgHuA3cAFVbWnHep8eneAHQ7c1BaAK4BrkozRG4mMDOtcJEn7NrQgqar37WPTGfvYfx2wbkB9FDhlQP05WhBJkmbOK2WyXZI0SxkkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnRgkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnRgkkqRODBJJUiczEiRJHkqyNcmdSUZb7egkNyd5oP08qm//i5OMJbk/yZl99dPaccaSXJokM3E+kjSXzeSI5DeqallVLW+fLwJuqaqlwC3tM0lOAkaAk4GVwGVJ5rU2lwNrgaVtWTmN/Zck8cq6tLUK2NjWNwJn99Wvrarnq+pBYAxYkWQhcERV3VZVBVzd10aSNE1mKkgK+GaSO5KsbbXXV9UOgPbzda2+CHikr+32VlvU1ifWXyTJ2iSjSUZ37tx5AE9DkjR/hn7vO6rq0SSvA25Oct8k+w6a96hJ6i8uVm0ANgAsX7584D6SpJdnRkYkVfVo+/k48FVgBfBYu1xF+/l42307cFxf88XAo62+eEBdkjSNpj1IkvxikteMrwO/DdwNbAZWt91WAze09c3ASJLDkpxAb1J9S7v8tSvJ6e1urfP62kiSpslMXNp6PfDVdqfufODLVfU3SX4IbEqyBngYOBegqrYl2QTcA+wGLqiqPe1Y5wNXAYcDN7VFkjSNpj1IqurvgLcMqP8UOGMfbdYB6wbUR4FTDnQfJUlT90q6/VeSNAsZJJKkTgwSSVInBokkqRODRJLUiUEiSerEIJEkdWKQSJI6MUgkSZ0YJJKkTgwSSVInBokkqRODRJLUiUEiSerEIJEkdWKQSJI6MUgkSZ0YJJKkTgwSSVInsz5IkqxMcn+SsSQXzXR/JGmumdVBkmQe8BfA7wAnAe9LctLM9kqS5pZZHSTACmCsqv6uqv4fcC2waob7JElzyvyZ7kBHi4BH+j5vB942cacka4G17eOzSe6fhr7NFccCT8x0J14J8pnVM90F7c2/zXGfzIE4yi/va8NsD5JB/zr1okLVBmDD8Lsz9yQZrarlM90PaSL/NqfPbL+0tR04ru/zYuDRGeqLJM1Jsz1IfggsTXJCkl8ARoDNM9wnSZpTZvWlraraneRC4L8D84AvVtW2Ge7WXOMlQ71S+bc5TVL1oikFSZKmbLZf2pIkzTCDRJLUiUGily3JO5M8neTOtnyib9vAR9ckuSrJOW396CR/m+RDM9F/HTza39WDfX+Ly1o9SS5tf4d3JfkXfW2e7Vs/K8kDSY6fge7PerN6sl0HXrv77dCq+j9TbPLdqvrdCccYf3TNb9G7RfuHSTZX1T19+xxJ7yaJDVV15YHpvQ5WSY6qqqf2s9sfV9X1E2q/Ayxty9uAy5nwpeUkZwCfB367qh4+QF2eUxyRCIAkv5Lks8D9wJs6Hm5/j655NXAT8OWqurzj79LcMJrky0neleSlfE17FXB19fwAeG2SheMbk/xL4AvAv6qq/32A+zxnGCRzWJJfTPKhJN8D/itwL/Dmqvrbtn1936WC/qX/Kcu/luTHSW5KcnKrDXp0zaK+z38GfK+q1g/v7HSQeRPwZeBC4J4kH0/yhgn7rGuXr9YnOazVJvtbPAy4ATi7qu4bYt8Pel7amtt2AHcB/27Qf0hV9R/20/5HwC9X1bNJzgK+Ru8Swv4eXXMrsCrJZ6rq8ZfVc80pVbUH+AbwjSQLgP8MPJzk7VW1BbgY+AfgF+h9f+RjwJ8y+d/iPwHfB9YAHxnuGRzcHJHMbecAPwG+muQTSfZ6KNv+RiRV9UxVPdvWbwQOTXIs+390zbX0rlXfmOQ1wzs9HUySHNkewLqZ3ghlDb3/EaKqdrTLV88DV9K7vAqT/y2+ALwHeGuSj0/DKRy0HJHMYVX1TeCbSY4B3g/ckOQJeiOUh/Y3IknyS8BjVVVJVtD7H5OfAj+jPbqGXlCNAP92wu/+83at+qtJzmpzKdJASf4b8GvAXwHnVdUDE7YvrKodbf7kbODutmkzcGGSa+lNsj9dVTvG21XVPyb5XeC7SR6rqium4XQOOgaJqKqfAp8DPtcCYc8Um54DnJ9kN/B/gZHqPSphSo+uqaqPJbkSuCbJ+6rqhQNxPjoobQI+WFW797H9S+2SV4A7gT9s9RuBs4Ax4B+BF91qXlVPJlkJfCfJE1V1w4Hu/MHOR6RIkjpxjkSS1IlBIknqxCCRJHVikEiSOjFIJEmdGCTSAZDk+zPdB2mmePuvJKkTRyTSATD+bov2jpZvJ7k+yX1JvjT+tNokb03y/faQyy1JXpPkVUmuTLK1vZvlN9q+H0zytSRfb+/ZuDDJf2z7/CDJ0W2/Nyb5myR3JPlukn8+c/8Kmqv8Zrt04J0KnEzvmU7/C3hHki3AdcB7q+qHSY6g9zSAjwBU1a+2EPhmkvHH+J/SjvUqet/M/lhVnZpkPXAe8Of0HlD4h1X1QJK3AZcB75qm85QAg0Qahi1VtR0gyZ3AEuBpYEdV/RB6D7xs23+d3kuVqKr7kvw9P38fzLeqahewK8nTwNdbfSvw5iSvBt4O/FXfKzrGH58uTRuDRDrwnu9b30Pvv7Ow96P0x032kqb+47zQ9/mFdsxDgJ9V1bKX3VPpAHCORJoe9wFvSPJWgDY/Mh/4DvD7rfYm4Hh6b6ncrzaqeTDJua19krxlGJ2XJmOQSNOgPSb/vcDnk/wYuJne3MdlwLwkW+nNoXywvVNjqn4fWNOOuY29X2ksTQtv/5UkdeKIRJLUiUEiSerEIJEkdWKQSJI6MUgkSZ0YJJKkTgwSSVIn/x890TuUh/5/qQAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "be30ea29",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "a6a7f016",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "dc213050",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e753c8e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "4bfb74da",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "4ac9c732",
"metadata": {},
"source": [
"### 2. Tratamiento de missing, reparación dataset y codificación de variables\n",
"\n",
"1. Reemplazar <=50K por 0 y los >50K por 1"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "ab8933f2",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4ba3013",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "dcdcc2e7",
"metadata": {},
"source": [
"2. Eliminar la columna income Y fnlwgt, dejando solo las Características "
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "2deffe45",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "e1ca69ac",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "37613f52",
"metadata": {},
"source": [
"3. Obtener el nombre de las columns númericas para luego normalizarlas, ejemplo: age, capital-loss, hours-per-week"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "64c88d76",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "f8db313b",
"metadata": {},
"source": [
"4. Utilizar la función get_dummies() de pandas para codificar las variables categóricas como : workclass, education, etc."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "618a4b91",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "724e4271",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "254c963d",
"metadata": {},
"source": [
"5. Normalizar los datos usando StandardScaler https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9ce945e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 20,
"id": "b251cd83",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "039aa289",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "545cd4f3",
"metadata": {},
"source": [
"6. Crear el vector Y con las clases"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "771e83dc",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa72dae5",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "153b8a37",
"metadata": {},
"source": [
"### 3. Determinar el conjunto de entrenamiento y el de validación."
]
},
{
"cell_type": "markdown",
"id": "56ad8a80",
"metadata": {},
"source": [
"1. Hacer división de los datos 80% train , 20% test Crear un vector X el cual contiene las características \n",
"2. Imprimir el shape o dimensiones del vector de entrenamiento (x_train)\n",
"2. Imprimir el shape o dimensiones del vector de prueba (x_test)\n",
"Ayuda: usar la función train_test_split de sklearn https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "1df16920",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "861c63b1",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "d7f8657d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "956287ee",
"metadata": {},
"source": [
"### 4. Entrenamiento del modelo"
]
},
{
"cell_type": "markdown",
"id": "7b18ba2e",
"metadata": {},
"source": [
"1. Crear un RandomForestClassifier model usando la librería sklearn https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n",
"2. Entrenar el modelo\n",
"\n",
"Ayudas:\n",
"\n",
"- Usar la función fit\n",
"- Solo usar el conjunto de entrenamiento (X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd1c9d84",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "da33a61b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "0ab2e1ed",
"metadata": {},
"source": [
"### 5. Calcular las métricas de evaluación"
]
},
{
"cell_type": "markdown",
"id": "a0af39eb",
"metadata": {},
"source": [
"**Nota:** Ejecutar la siguiente función, la cual calcula crea la matriz de confusión y algunas métricas. "
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "cd5297c5",
"metadata": {},
"outputs": [],
"source": [
" def metrics(y_true,y_pred):\n",
" \"\"\"\n",
" This method calculate some metrics shuch as acurracy,f1-score,precision and create confusion matrix figure.\n",
"\n",
" Args:\n",
" y_true (numpy_array): true classes\n",
" y_pred (numpy_array): predict classes\n",
"\n",
" Returns:\n",
" \n",
" cm_fig (ConfusionMatrixDisplay: Confusion matrix figure\n",
" accuracy (float): acurracy\n",
" report (dict): some metrics\n",
"\n",
" \"\"\"\n",
" cm = confusion_matrix(y_true,y_pred, normalize='true')\n",
" report = classification_report(y_true,y_pred,output_dict=True)\n",
" cm_fig = ConfusionMatrixDisplay(confusion_matrix=cm)\n",
" return cm_fig,report[\"accuracy\"],report"
]
},
{
"cell_type": "markdown",
"id": "1b4bf45e",
"metadata": {},
"source": [
"1. Usar la función predict() para crear el vector de predicciones\n",
"\n",
"Ayuda: Utilice el conjunto de test (X_test)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "6ec28171",
"metadata": {},
"outputs": [],
"source": [
"y_predict = "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61d5da41",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"Utiliza la función metrics, debes reemplazar las variables\n",
"y_test por las clases del conjunto de test y y_predict por las predicciones obtenidas de tu modelo.\n",
"\n",
"\"\"\"\n",
"cm_fig,test_score, report = metrics(y_test,y_predict)\n",
"cm_fig.plot(cmap=plt.cm.Blues)"
]
},
{
"cell_type": "markdown",
"id": "a875f0a0",
"metadata": {},
"source": [
"### 6. Conclusiones"
]
},
{
"cell_type": "markdown",
"id": "5f619782",
"metadata": {},
"source": [
"Describa brevemente los resultados obtenidos, incluyendo el accuracy y mencionando el comportamiento del modelo clasificando muestras para ambas clases."
]
},
{
"cell_type": "markdown",
"id": "4f458fbb",
"metadata": {},
"source": [
"\n",
"Escribir conclusiones"
]
},
{
"cell_type": "markdown",
"id": "44ca4281",
"metadata": {},
"source": [
"# 2. Regresión"
]
},
{
"cell_type": "markdown",
"id": "1bea7949",
"metadata": {},
"source": [
"### Información del dataset\n",
"\n",
"https://www.kaggle.com/datasets/gunhee/koreahousedata\n",
"\n",
"### Apartment data\n",
"\n",
"Los datos de transacciones de apartamentos se generan entre agosto de 2007 y agosto de 2017 en el estricto Daebong, ciudad de Daegu, Corea del Sur\n"
]
},
{
"cell_type": "code",
"execution_count": 101,
"id": "f9b5e04f",
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv(\"resources/Daegu_Real_Estate_data.csv\")"
]
},
{
"cell_type": "markdown",
"id": "f3913d95",
"metadata": {},
"source": [
"### Tarea\n",
"\n",
"Predecir el precio de un apartamento"
]
},
{
"cell_type": "markdown",
"id": "70c256b3",
"metadata": {},
"source": [
"### 1. Análisis exploratorio de los datos"
]
},
{
"cell_type": "markdown",
"id": "1e914c4b",
"metadata": {},
"source": [
"1. Imprima el número de registros del dataset\n",
"2. Imprima el número de variables del dataset\n",
"3. Imprima el nombre de las columnas del dataset\n",
"4. Imprima el **head** del dataset\n",
"5. Imprima el **tail** del dataset\n",
"6. Imprima **info** basica del dataset\n",
"7. Imprima un **describe** del dataset\n",
"8. Realizar un gráfico de dispersión relacionando el Size(sqf) y el SalePrice de las viviendas.\n"
]
},
{
"cell_type": "code",
"execution_count": 102,
"id": "94511e24",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de registros 5891\n",
"Número de variables 30\n"
]
}
],
"source": [
"print(\"Número de registros\",)\n",
"print(\"Número de variables\",d)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "adec743c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "5a9c1762",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "4cc27b54",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc88d985",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a599fe4",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "bbfb9414",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "ec389d38",
"metadata": {},
"source": [
"### 2. Tratamiento de missing, reparación dataset y codificación de variables"
]
},
{
"cell_type": "markdown",
"id": "5917a1a2",
"metadata": {},
"source": [
"1. Seleccionar la variable a predecir (SalePrice) crear un vector llamdo Y con dicha información"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06f00ecf",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "b823690f",
"metadata": {},
"source": [
"2. Eliminar la columna SalePrice del dataset"
]
},
{
"cell_type": "code",
"execution_count": 110,
"id": "eb37fed0",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "4be0d94e",
"metadata": {},
"source": [
"3. Identificar las columnas numericas para luego normalizar "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e13f6f8c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "96f7c676",
"metadata": {},
"source": [
"4. Transformar la variables categóricas usando el método get_dummies() de pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3323a72c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "cbf274eb",
"metadata": {},
"source": [
"5. Normalizar solo las variables numericas previamente encontradas."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24a88313",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "92674c0b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "961708e1",
"metadata": {},
"source": [
"6. Imprimir el head del dataset resultante"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "830a7bab",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "1f44b37f",
"metadata": {},
"source": [
"### 3. Determinar el conjunto de entrenamiento y el de validación.\n",
"\n",
"\n",
"1. Hacer división de los datos 80% train , 20% test Crear un vector X el cual contiene las características \n",
"2. Imprimir el shape o dimensiones del vector de entrenamiento (x_train)\n",
"2. Imprimir el shape o dimensiones del vector de prueba (x_test)\n",
"Ayuda: usar la función train_test_split de sklearn https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fcfe125e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 119,
"id": "cd47fbba",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dimensiones vector de entrenamiento (4712, 46)\n",
"Dimensiones vector de prueba (1179, 46)\n"
]
}
],
"source": [
"print(\"Dimensiones vector de entrenamiento\", )\n",
"print(\"Dimensiones vector de prueba\", )"
]
},
{
"cell_type": "markdown",
"id": "4faa0b54",
"metadata": {},
"source": [
"### 4. Entrenamiento del modelo\n",
"\n",
"\n",
"1. Crear un RandomForestRegressor model usando la librería sklearn https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html\n",
"\n",
"2. Entrenar el modelo\n",
"\n",
"Ayudas:\n",
"\n",
"- Usar la función fit\n",
"- Solo usar el conjunto de entrenamiento (X_train, y_train)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "878a71e6",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "0dd38bfa",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "665b5534",
"metadata": {},
"source": [
"### 5. Calcular las métricas de evaluación\n",
"\n",
"**Nota:** Ejecutar la siguiente función, la cual calcula crea la matriz de confusión y algunas métricas. "
]
},
{
"cell_type": "markdown",
"id": "82e61a22",
"metadata": {},
"source": [
"1. Usar la función predict() para crear el vector de predicciones\n",
"\n",
"\n",
"Ayuda: Utilice el conjunto de test (X_test)"
]
},
{
"cell_type": "code",
"execution_count": 131,
"id": "47d65de7",
"metadata": {},
"outputs": [],
"source": [
"y_predict = "
]
},
{
"cell_type": "markdown",
"id": "a772c4e5",
"metadata": {},
"source": [
"2. Calcular métricas de error"
]
},
{
"cell_type": "code",
"execution_count": 137,
"id": "17691248",
"metadata": {},
"outputs": [],
"source": [
"mae_test = m.mean_absolute_error(y_test, y_predict )\n",
"mape_test = np.mean(np.abs((y_test - y_predict)/ y_test))\n",
"MSE_test = mean_squared_error(y_test,y_predict)\n",
"RMSE_test = mean_squared_error(y_test,y_predict,squared=False) \n",
"R2_test = r2_score(y_test,y_predict)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7549632f",
"metadata": {},
"outputs": [],
"source": [
"print(\"MAE\",mae_test)\n",
"print(\"MAPE\",mape_test)\n",
"print(\"MSE\",MSE_test)\n",
"print(\"RMSE\",RMSE_test)\n",
"print(\"R2\",R2_test)"
]
},
{
"cell_type": "markdown",
"id": "fff6abf1",
"metadata": {},
"source": [
"### 6. Conclusiones\n",
"\n",
"Describa brevemente los resultados obtenidos"
]
},
{
"cell_type": "markdown",
"id": "9db5b8d8",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "9f51468e",
"metadata": {},
"source": [
"Realizar un gráfico de dispersión entre y_test y y_predict"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93cb1d78",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}