{ "cells": [ { "cell_type": "markdown", "id": "16be11e8", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "02d0018f", "metadata": {}, "source": [ "# Regresión y Clasificación con Arboles de decisión" ] }, { "cell_type": "code", "execution_count": 136, "id": "6a6cf0c1", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import classification_report,confusion_matrix,ConfusionMatrixDisplay\n", "import seaborn as sns\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn import model_selection as ms\n", "from sklearn.metrics import r2_score \n", "from sklearn.metrics import mean_squared_error\n", "from sklearn import metrics as m" ] }, { "cell_type": "markdown", "id": "ff149f41", "metadata": {}, "source": [ "# 1. Clasificación" ] }, { "cell_type": "markdown", "id": "c4e5a4f0", "metadata": {}, "source": [ "### Información del dataset\n", "\n", "### Adult income dataset\n", "\n", "https://archive.ics.uci.edu/ml/datasets/Adult\n", "\n", "\n", "1. edad: continua.\n", "2. clase de trabajo: Privado, Autónomo-no-inc, Autónomo-inc, Federal-gov, Local-gov, Estatal-gov, Sin-trabajo, Nunca-trabajo.\n", "3. fnlwgt: continuo.\n", "4. educación: Bachillerato, Algún tipo de universidad, 11º, Grado de secundaria, Escuela profesional, Asistente de dirección, Asistente de dirección, 9º, 7º-8º, 12º, Máster, 1º-4º, 10º, Doctorado, 5º-6º, Preescolar.\n", "5. education-num: continuous.\n", "6. estado civil: Casado-cónyuge, Divorciado, No casado, Separado, Viudo, Casado-cónyuge-ausente, Casado-cónyuge.\n", "7. Ocupación: Apoyo técnico, Reparación artesanal, Otros servicios, Ventas, Directivo, Especialidad profesional, Manipulador-limpiador, Operador de maquinaria, Administrativo-empleado, Agricultura-pesca, Transporte-movimiento, Servicio doméstico privado, Servicio de protección, Fuerzas armadas.\n", "8. Relación: Esposa, Hijo propio, Esposo, No familiar, Otro pariente, Soltero.\n", "9. Raza: Blanco, Asiático-Pacífico-Islandés, Amerindio-Esquimal, Otro, Negro.\n", "10. Sexo: Mujer, Hombre.\n", "11. plusvalía: continua.\n", "12. pérdida de capital: continua.\n", "13. horas-semana: continuo.\n", "14. país de origen: Estados Unidos, Camboya, Inglaterra, Puerto-Rico, Canadá, Alemania, Estados Unidos periféricos (Guam-USVI-etc), India, Japón, Grecia, Sur, China, Cuba, Irán, Honduras, Filipinas, Italia, Polonia, Jamaica, Vietnam, México, Portugal, Irlanda, Francia, República Dominicana, Laos, Ecuador, Taiwán, Haití, Colombia, Hungría, Guatemala, Nicaragua, Escocia, Tailandia, Yugoslavia, El Salvador, Trinad&Tobago, Perú, Hong, Holanda.\n", "15. clase: >50K, <=50K" ] }, { "cell_type": "code", "execution_count": 2, "id": "5c686ceb", "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('resources/adult_.csv')" ] }, { "cell_type": "code", "execution_count": 3, "id": "8f164d20", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducational-nummarital-statusoccupationrelationshipracegendercapital-gaincapital-losshours-per-weeknative-countryincome
025Private22680211th7Never-marriedMachine-op-inspctOwn-childBlackMale0040United-States<=50K
138Private89814HS-grad9Married-civ-spouseFarming-fishingHusbandWhiteMale0050United-States<=50K
228Local-gov336951Assoc-acdm12Married-civ-spouseProtective-servHusbandWhiteMale0040United-States>50K
344Private160323Some-college10Married-civ-spouseMachine-op-inspctHusbandBlackMale7688040United-States>50K
418?103497Some-college10Never-married?Own-childWhiteFemale0030United-States<=50K
................................................
4883727Private257302Assoc-acdm12Married-civ-spouseTech-supportWifeWhiteFemale0038United-States<=50K
4883840Private154374HS-grad9Married-civ-spouseMachine-op-inspctHusbandWhiteMale0040United-States>50K
4883958Private151910HS-grad9WidowedAdm-clericalUnmarriedWhiteFemale0040United-States<=50K
4884022Private201490HS-grad9Never-marriedAdm-clericalOwn-childWhiteMale0020United-States<=50K
4884152Self-emp-inc287927HS-grad9Married-civ-spouseExec-managerialWifeWhiteFemale15024040United-States>50K
\n", "

48842 rows × 15 columns

\n", "
" ], "text/plain": [ " age workclass fnlwgt education educational-num \\\n", "0 25 Private 226802 11th 7 \n", "1 38 Private 89814 HS-grad 9 \n", "2 28 Local-gov 336951 Assoc-acdm 12 \n", "3 44 Private 160323 Some-college 10 \n", "4 18 ? 103497 Some-college 10 \n", "... ... ... ... ... ... \n", "48837 27 Private 257302 Assoc-acdm 12 \n", "48838 40 Private 154374 HS-grad 9 \n", "48839 58 Private 151910 HS-grad 9 \n", "48840 22 Private 201490 HS-grad 9 \n", "48841 52 Self-emp-inc 287927 HS-grad 9 \n", "\n", " marital-status occupation relationship race gender \\\n", "0 Never-married Machine-op-inspct Own-child Black Male \n", "1 Married-civ-spouse Farming-fishing Husband White Male \n", "2 Married-civ-spouse Protective-serv Husband White Male \n", "3 Married-civ-spouse Machine-op-inspct Husband Black Male \n", "4 Never-married ? Own-child White Female \n", "... ... ... ... ... ... \n", "48837 Married-civ-spouse Tech-support Wife White Female \n", "48838 Married-civ-spouse Machine-op-inspct Husband White Male \n", "48839 Widowed Adm-clerical Unmarried White Female \n", "48840 Never-married Adm-clerical Own-child White Male \n", "48841 Married-civ-spouse Exec-managerial Wife White Female \n", "\n", " capital-gain capital-loss hours-per-week native-country income \n", "0 0 0 40 United-States <=50K \n", "1 0 0 50 United-States <=50K \n", "2 0 0 40 United-States >50K \n", "3 7688 0 40 United-States >50K \n", "4 0 0 30 United-States <=50K \n", "... ... ... ... ... ... \n", "48837 0 0 38 United-States <=50K \n", "48838 0 0 40 United-States >50K \n", "48839 0 0 40 United-States <=50K \n", "48840 0 0 20 United-States <=50K \n", "48841 15024 0 40 United-States >50K \n", "\n", "[48842 rows x 15 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "markdown", "id": "58b44e18", "metadata": {}, "source": [ "### Tarea\n", "\n", " Predecir si los ingresos exceden los $50K/año según los datos del censo. También conocido como conjunto de datos de \"Ingresos del censo\". \n", " \n", "Tenemos dos clases: \n", "\n", "1. .<=50k\n", "2. .>50k\n", " \n", " " ] }, { "cell_type": "markdown", "id": "7e7e50de", "metadata": {}, "source": [ "### 1. Análisis exploratorio de los datos" ] }, { "cell_type": "markdown", "id": "e4b0f2b8", "metadata": {}, "source": [ "1. Imprima el número de registros del dataset\n", "2. Imprima el número de variables del dataset\n", "3. Imprima el nombre de las columnas del dataset\n", "4. Imprima el **head** del dataset\n", "5. Imprima el **tail** del dataset\n", "6. Imprima **info** basica del dataset\n", "7. Imprima un **describe** del dataset\n", "8. Graficar la distribución de clases usando un diagrama de barras (Recomendación: Usar la librería seaborn).\n", "9. " ] }, { "cell_type": "code", "execution_count": 4, "id": "dab623a3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Número de regristros 48842\n", "Número de variables 15\n" ] } ], "source": [ "print(\"Número de regristros\",)\n", "print(\"Número de variables\",)" ] }, { "cell_type": "code", "execution_count": 6, "id": "9649f77f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducational-nummarital-statusoccupationrelationshipracegendercapital-gaincapital-losshours-per-weeknative-countryincome
4883727Private257302Assoc-acdm12Married-civ-spouseTech-supportWifeWhiteFemale0038United-States<=50K
4883840Private154374HS-grad9Married-civ-spouseMachine-op-inspctHusbandWhiteMale0040United-States>50K
4883958Private151910HS-grad9WidowedAdm-clericalUnmarriedWhiteFemale0040United-States<=50K
4884022Private201490HS-grad9Never-marriedAdm-clericalOwn-childWhiteMale0020United-States<=50K
4884152Self-emp-inc287927HS-grad9Married-civ-spouseExec-managerialWifeWhiteFemale15024040United-States>50K
\n", "
" ], "text/plain": [ " age workclass fnlwgt education educational-num \\\n", "48837 27 Private 257302 Assoc-acdm 12 \n", "48838 40 Private 154374 HS-grad 9 \n", "48839 58 Private 151910 HS-grad 9 \n", "48840 22 Private 201490 HS-grad 9 \n", "48841 52 Self-emp-inc 287927 HS-grad 9 \n", "\n", " marital-status occupation relationship race gender \\\n", "48837 Married-civ-spouse Tech-support Wife White Female \n", "48838 Married-civ-spouse Machine-op-inspct Husband White Male \n", "48839 Widowed Adm-clerical Unmarried White Female \n", "48840 Never-married Adm-clerical Own-child White Male \n", "48841 Married-civ-spouse Exec-managerial Wife White Female \n", "\n", " capital-gain capital-loss hours-per-week native-country income \n", "48837 0 0 38 United-States <=50K \n", "48838 0 0 40 United-States >50K \n", "48839 0 0 40 United-States <=50K \n", "48840 0 0 20 United-States <=50K \n", "48841 15024 0 40 United-States >50K " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "code", "execution_count": 7, "id": "f8f8bb9f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 48842 entries, 0 to 48841\n", "Data columns (total 15 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 age 48842 non-null int64 \n", " 1 workclass 48842 non-null object\n", " 2 fnlwgt 48842 non-null int64 \n", " 3 education 48842 non-null object\n", " 4 educational-num 48842 non-null int64 \n", " 5 marital-status 48842 non-null object\n", " 6 occupation 48842 non-null object\n", " 7 relationship 48842 non-null object\n", " 8 race 48842 non-null object\n", " 9 gender 48842 non-null object\n", " 10 capital-gain 48842 non-null int64 \n", " 11 capital-loss 48842 non-null int64 \n", " 12 hours-per-week 48842 non-null int64 \n", " 13 native-country 48842 non-null object\n", " 14 income 48842 non-null object\n", "dtypes: int64(6), object(9)\n", "memory usage: 5.6+ MB\n" ] } ], "source": [] }, { "cell_type": "code", "execution_count": 8, "id": "9a48bedb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
age48842.038.64358513.71051017.028.037.048.090.0
fnlwgt48842.0189664.134597105604.02542312285.0117550.5178144.5237642.01490400.0
educational-num48842.010.0780892.5709731.09.010.012.016.0
capital-gain48842.01079.0676267452.0190580.00.00.00.099999.0
capital-loss48842.087.502314403.0045520.00.00.00.04356.0
hours-per-week48842.040.42238212.3914441.040.040.045.099.0
\n", "
" ], "text/plain": [ " count mean std min 25% \\\n", "age 48842.0 38.643585 13.710510 17.0 28.0 \n", "fnlwgt 48842.0 189664.134597 105604.025423 12285.0 117550.5 \n", "educational-num 48842.0 10.078089 2.570973 1.0 9.0 \n", "capital-gain 48842.0 1079.067626 7452.019058 0.0 0.0 \n", "capital-loss 48842.0 87.502314 403.004552 0.0 0.0 \n", "hours-per-week 48842.0 40.422382 12.391444 1.0 40.0 \n", "\n", " 50% 75% max \n", "age 37.0 48.0 90.0 \n", "fnlwgt 178144.5 237642.0 1490400.0 \n", "educational-num 10.0 12.0 16.0 \n", "capital-gain 0.0 0.0 99999.0 \n", "capital-loss 0.0 0.0 4356.0 \n", "hours-per-week 40.0 45.0 99.0 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "code", "execution_count": 9, "id": "7b0f44a1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
workclasseducationmarital-statusoccupationrelationshipracegendernative-countryincome
count488424884248842488424884248842488424884248842
unique916715652422
topPrivateHS-gradMarried-civ-spouseProf-specialtyHusbandWhiteMaleUnited-States<=50K
freq33906157842237961721971641762326504383237155
\n", "
" ], "text/plain": [ " workclass education marital-status occupation relationship \\\n", "count 48842 48842 48842 48842 48842 \n", "unique 9 16 7 15 6 \n", "top Private HS-grad Married-civ-spouse Prof-specialty Husband \n", "freq 33906 15784 22379 6172 19716 \n", "\n", " race gender native-country income \n", "count 48842 48842 48842 48842 \n", "unique 5 2 42 2 \n", "top White Male United-States <=50K \n", "freq 41762 32650 43832 37155 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "code", "execution_count": 10, "id": "94c1f279", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['<=50K', '>50K'], dtype=object)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "code", "execution_count": 11, "id": "69bb63c2", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\seaborn\\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n", " warnings.warn(\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZIAAAEGCAYAAABPdROvAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAX+ElEQVR4nO3df7BfdX3n8eeLhCJbBfkRbUygYSTOFqiGJUaq3RkrbUnZ7gY7oNetEm12YhnY0d1OR/APtd3JrMxqU3EKM3ERAquFFKtEB7qyoKuuSLxYJIQfw91CIZJCEISwXdgmvPeP7+eO31y+ublw8r2Xm/t8zJy55/s+53Pu52Qu8+JzPud7TqoKSZJerkNmugOSpNnNIJEkdWKQSJI6MUgkSZ0YJJKkTubPdAem27HHHltLliyZ6W5I0qxyxx13PFFVCwZtm3NBsmTJEkZHR2e6G5I0qyT5+31t89KWJKkTg0SS1IlBIknqxCCRJHVikEiSOjFIJEmdGCSSpE4MEklSJwaJJKmTOffN9gPhtD++eqa7oFegO/7LeTPdBWlGOCKRJHVikEiSOjFIJEmdGCSSpE4MEklSJwaJJKkTg0SS1IlBIknqZGhBkuRVSbYk+XGSbUn+pNU/leQnSe5sy1l9bS5OMpbk/iRn9tVPS7K1bbs0SVr9sCTXtfrtSZYM63wkSYMNc0TyPPCuqnoLsAxYmeT0tm19VS1ry40ASU4CRoCTgZXAZUnmtf0vB9YCS9uystXXAE9V1YnAeuCSIZ6PJGmAoQVJ9TzbPh7alpqkySrg2qp6vqoeBMaAFUkWAkdU1W1VVcDVwNl9bTa29euBM8ZHK5Kk6THUOZIk85LcCTwO3FxVt7dNFya5K8kXkxzVaouAR/qab2+1RW19Yn2vNlW1G3gaOGZAP9YmGU0yunPnzgNzcpIkYMhBUlV7qmoZsJje6OIUepep3kjvctcO4LNt90EjiZqkPlmbif3YUFXLq2r5ggULXtI5SJImNy13bVXVz4BvAyur6rEWMC8AXwBWtN22A8f1NVsMPNrqiwfU92qTZD5wJPDkcM5CkjTIMO/aWpDktW39cOA3gfvanMe4dwN3t/XNwEi7E+sEepPqW6pqB7Aryelt/uM84Ia+Nqvb+jnArW0eRZI0TYb5PpKFwMZ259UhwKaq+kaSa5Iso3cJ6iHgwwBVtS3JJuAeYDdwQVXtacc6H7gKOBy4qS0AVwDXJBmjNxIZGeL5SJIGGFqQVNVdwKkD6h+YpM06YN2A+ihwyoD6c8C53XoqSerCb7ZLkjoxSCRJnRgkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnRgkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnQwtSJK8KsmWJD9Osi3Jn7T60UluTvJA+3lUX5uLk4wluT/JmX3105JsbdsuTZJWPyzJda1+e5IlwzofSdJgwxyRPA+8q6reAiwDViY5HbgIuKWqlgK3tM8kOQkYAU4GVgKXJZnXjnU5sBZY2paVrb4GeKqqTgTWA5cM8XwkSQMMLUiq59n28dC2FLAK2NjqG4Gz2/oq4Nqqer6qHgTGgBVJFgJHVNVtVVXA1RPajB/reuCM8dGKJGl6DHWOJMm8JHcCjwM3V9XtwOuragdA+/m6tvsi4JG+5ttbbVFbn1jfq01V7QaeBo4ZyslIkgYaapBU1Z6qWgYspje6OGWS3QeNJGqS+mRt9j5wsjbJaJLRnTt37qfXkqSXYlru2qqqnwHfpje38Vi7XEX7+XjbbTtwXF+zxcCjrb54QH2vNknmA0cCTw74/RuqanlVLV+wYMGBOSlJEjDcu7YWJHltWz8c+E3gPmAzsLrtthq4oa1vBkbanVgn0JtU39Iuf+1Kcnqb/zhvQpvxY50D3NrmUSRJ02T+EI+9ENjY7rw6BNhUVd9IchuwKcka4GHgXICq2pZkE3APsBu4oKr2tGOdD1wFHA7c1BaAK4BrkozRG4mMDPF8JEkDDC1Iquou4NQB9Z8CZ+yjzTpg3YD6KPCi+ZWqeo4WRJKkmeE32yVJnRgkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnRgkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnRgkkqRODBJJUidDC5IkxyX5VpJ7k2xL8pFW/1SSnyS5sy1n9bW5OMlYkvuTnNlXPy3J1rbt0iRp9cOSXNfqtydZMqzzkSQNNswRyW7gj6rqV4DTgQuSnNS2ra+qZW25EaBtGwFOBlYClyWZ1/a/HFgLLG3LylZfAzxVVScC64FLhng+kqQBhhYkVbWjqn7U1ncB9wKLJmmyCri2qp6vqgeBMWBFkoXAEVV1W1UVcDVwdl+bjW39euCM8dGKJGl6TMscSbvkdCpweytdmOSuJF9MclSrLQIe6Wu2vdUWtfWJ9b3aVNVu4GngmAG/f22S0SSjO3fuPDAnJUkCpiFIkrwa+Arw0ap6ht5lqjcCy4AdwGfHdx3QvCapT9Zm70LVhqpaXlXLFyxY8NJOQJI0qaEGSZJD6YXIl6rqrwGq6rGq2lNVLwBfAFa03bcDx/U1Xww82uqLB9T3apNkPnAk8ORwzkaSNMgw79oKcAVwb1X9WV99Yd9u7wbubuubgZF2J9YJ9CbVt1TVDmBXktPbMc8Dbuhrs7qtnwPc2uZRJEnTZP4Qj/0O4APA1iR3ttrHgfclWUbvEtRDwIcBqmpbkk3APfTu+Lqgqva0ducDVwGHAze1BXpBdU2SMXojkZEhno8kaYChBUlVfY/Bcxg3TtJmHbBuQH0UOGVA/Tng3A7dlCR15DfbJUmdGCSSpE4MEklSJwaJJKkTg0SS1IlBIknqxCCRJHVikEiSOjFIJEmdGCSSpE4MEklSJwaJJKkTg0SS1IlBIknqZEpBkuSWqdQkSXPPpO8jSfIq4J8BxyY5ip+/X+QI4A1D7pskaRbY34utPgx8lF5o3MHPg+QZ4C+G1y1J0mwxaZBU1eeAzyX591X1+WnqkyRpFpnSq3ar6vNJ3g4s6W9TVVcPqV+SpFliqpPt1wCfAX4deGtblu+nzXFJvpXk3iTbknyk1Y9OcnOSB9rPo/raXJxkLMn9Sc7sq5+WZGvbdmmStPphSa5r9duTLHmp/wCSpG6mNCKhFxonVVW9hGPvBv6oqn6U5DXAHUluBj4I3FJVn05yEXAR8LEkJwEjwMn05mT+R5I3VdUe4HJgLfAD4EZgJXATsAZ4qqpOTDICXAK89yX0UZLU0VS/R3I38Esv5cBVtaOqftTWdwH3AouAVcDGtttG4Oy2vgq4tqqer6oHgTFgRZKFwBFVdVsLsqsntBk/1vXAGeOjFUnS9JjqiORY4J4kW4Dnx4tV9W+m0rhdcjoVuB14fVXtaO13JHld220RvRHHuO2t9k9tfWJ9vM0j7Vi7kzwNHAM8MeH3r6U3ouH444+fSpclSVM01SD51Mv9BUleDXwF+GhVPTPJgGHQhpqkPlmbvQtVG4ANAMuXL38pl+ckSfsx1bu2/ufLOXiSQ+mFyJeq6q9b+bEkC9toZCHweKtvB47ra74YeLTVFw+o97fZnmQ+cCTw5MvpqyTp5ZnqXVu7kjzTlueS7EnyzH7aBLgCuLeq/qxv02ZgdVtfDdzQVx9pd2KdACwFtrTLYLuSnN6Oed6ENuPHOge49SXeECBJ6miqI5LX9H9OcjawYj/N3gF8ANia5M5W+zjwaWBTkjXAw8C57XdsS7IJuIfeHV8XtDu2AM4HrgIOp3e31k2tfgVwTZIxeiORkamcjyTpwJnqHMlequpr7dbdyfb5HoPnMADO2EebdcC6AfVR4JQB9edoQSRJmhlTCpIkv9f38RB63yvxEpIkacojkn/dt74beIjedzgkSXPcVOdIPjTsjkiSZqep3rW1OMlXkzye5LEkX0myeP8tJUkHu6k+IuVKerfavoHet8m/3mqSpDluqkGyoKqurKrdbbkKWDDEfkmSZompBskTSd6fZF5b3g/8dJgdkyTNDlMNkj8A3gP8A7CD3rfInYCXJE359t//BKyuqqeg93Iqei+6+oNhdUySNDtMdUTy5vEQAaiqJ+k9Fl6SNMdNNUgOmfBK3KN5mY9XkSQdXKYaBp8Fvp/kenqPRnkPA56JJUmae6b6zfark4wC76L3IMbfq6p7htozSdKsMOXLUy04DA9J0l6mOkciSdJABokkqRODRJLUiUEiSerEIJEkdTK0IEnyxfb+krv7ap9K8pMkd7blrL5tFycZS3J/kjP76qcl2dq2XZokrX5Ykuta/fYkS4Z1LpKkfRvmiOQqYOWA+vqqWtaWGwGSnASMACe3Npclmdf2vxxYCyxty/gx1wBPVdWJwHrgkmGdiCRp34YWJFX1HeDJKe6+Cri2qp6vqgeBMWBFkoXAEVV1W1UVcDVwdl+bjW39euCM8dGKJGn6zMQcyYVJ7mqXvsaf37UIeKRvn+2ttqitT6zv1aaqdgNPA8cM+oVJ1iYZTTK6c+fOA3cmkqRpD5LLgTcCy+i91+SzrT5oJFGT1Cdr8+Ji1YaqWl5Vyxcs8MWOknQgTWuQVNVjVbWnql4AvgCsaJu2A8f17boYeLTVFw+o79UmyXzgSKZ+KU2SdIBMa5C0OY9x7wbG7+jaDIy0O7FOoDepvqWqdgC7kpze5j/OA27oa7O6rZ8D3NrmUSRJ02ho7xRJ8pfAO4Fjk2wHPgm8M8kyepegHgI+DFBV25JsovdQyN3ABVW1px3qfHp3gB0O3NQWgCuAa5KM0RuJjAzrXCRJ+za0IKmq9w0oXzHJ/usY8I6TqhoFThlQfw44t0sfJUnd+c12SVInBokkqRODRJLUiUEiSerEIJEkdWKQSJI6MUgkSZ0YJJKkTgwSSVInBokkqRODRJLUiUEiSerEIJEkdTK0p/9Kmn4P/+mvznQX9Ap0/Ce2DvX4jkgkSZ0YJJKkTgwSSVInBokkqRODRJLUydCCJMkXkzye5O6+2tFJbk7yQPt5VN+2i5OMJbk/yZl99dOSbG3bLk2SVj8syXWtfnuSJcM6F0nSvg1zRHIVsHJC7SLglqpaCtzSPpPkJGAEOLm1uSzJvNbmcmAtsLQt48dcAzxVVScC64FLhnYmkqR9GlqQVNV3gCcnlFcBG9v6RuDsvvq1VfV8VT0IjAErkiwEjqiq26qqgKsntBk/1vXAGeOjFUnS9JnuOZLXV9UOgPbzda2+CHikb7/trbaorU+s79WmqnYDTwPHDPqlSdYmGU0yunPnzgN0KpIkeOVMtg8aSdQk9cnavLhYtaGqllfV8gULFrzMLkqSBpnuIHmsXa6i/Xy81bcDx/Xttxh4tNUXD6jv1SbJfOBIXnwpTZI0ZNMdJJuB1W19NXBDX32k3Yl1Ar1J9S3t8teuJKe3+Y/zJrQZP9Y5wK1tHkWSNI2G9tDGJH8JvBM4Nsl24JPAp4FNSdYADwPnAlTVtiSbgHuA3cAFVbWnHep8eneAHQ7c1BaAK4BrkozRG4mMDOtcJEn7NrQgqar37WPTGfvYfx2wbkB9FDhlQP05WhBJkmbOK2WyXZI0SxkkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnRgkkqRODBJJUicGiSSpE4NEktSJQSJJ6sQgkSR1YpBIkjoxSCRJnRgkkqRODBJJUiczEiRJHkqyNcmdSUZb7egkNyd5oP08qm//i5OMJbk/yZl99dPaccaSXJokM3E+kjSXzeSI5DeqallVLW+fLwJuqaqlwC3tM0lOAkaAk4GVwGVJ5rU2lwNrgaVtWTmN/Zck8cq6tLUK2NjWNwJn99Wvrarnq+pBYAxYkWQhcERV3VZVBVzd10aSNE1mKkgK+GaSO5KsbbXXV9UOgPbzda2+CHikr+32VlvU1ifWXyTJ2iSjSUZ37tx5AE9DkjR/hn7vO6rq0SSvA25Oct8k+w6a96hJ6i8uVm0ANgAsX7584D6SpJdnRkYkVfVo+/k48FVgBfBYu1xF+/l42307cFxf88XAo62+eEBdkjSNpj1IkvxikteMrwO/DdwNbAZWt91WAze09c3ASJLDkpxAb1J9S7v8tSvJ6e1urfP62kiSpslMXNp6PfDVdqfufODLVfU3SX4IbEqyBngYOBegqrYl2QTcA+wGLqiqPe1Y5wNXAYcDN7VFkjSNpj1IqurvgLcMqP8UOGMfbdYB6wbUR4FTDnQfJUlT90q6/VeSNAsZJJKkTgwSSVInBokkqRODRJLUiUEiSerEIJEkdWKQSJI6MUgkSZ0YJJKkTgwSSVInBokkqRODRJLUiUEiSerEIJEkdWKQSJI6MUgkSZ0YJJKkTgwSSVInsz5IkqxMcn+SsSQXzXR/JGmumdVBkmQe8BfA7wAnAe9LctLM9kqS5pZZHSTACmCsqv6uqv4fcC2waob7JElzyvyZ7kBHi4BH+j5vB942cacka4G17eOzSe6fhr7NFccCT8x0J14J8pnVM90F7c2/zXGfzIE4yi/va8NsD5JB/zr1okLVBmDD8Lsz9yQZrarlM90PaSL/NqfPbL+0tR04ru/zYuDRGeqLJM1Jsz1IfggsTXJCkl8ARoDNM9wnSZpTZvWlraraneRC4L8D84AvVtW2Ge7WXOMlQ71S+bc5TVL1oikFSZKmbLZf2pIkzTCDRJLUiUGily3JO5M8neTOtnyib9vAR9ckuSrJOW396CR/m+RDM9F/HTza39WDfX+Ly1o9SS5tf4d3JfkXfW2e7Vs/K8kDSY6fge7PerN6sl0HXrv77dCq+j9TbPLdqvrdCccYf3TNb9G7RfuHSTZX1T19+xxJ7yaJDVV15YHpvQ5WSY6qqqf2s9sfV9X1E2q/Ayxty9uAy5nwpeUkZwCfB367qh4+QF2eUxyRCIAkv5Lks8D9wJs6Hm5/j655NXAT8OWqurzj79LcMJrky0neleSlfE17FXB19fwAeG2SheMbk/xL4AvAv6qq/32A+zxnGCRzWJJfTPKhJN8D/itwL/Dmqvrbtn1936WC/qX/Kcu/luTHSW5KcnKrDXp0zaK+z38GfK+q1g/v7HSQeRPwZeBC4J4kH0/yhgn7rGuXr9YnOazVJvtbPAy4ATi7qu4bYt8Pel7amtt2AHcB/27Qf0hV9R/20/5HwC9X1bNJzgK+Ru8Swv4eXXMrsCrJZ6rq8ZfVc80pVbUH+AbwjSQLgP8MPJzk7VW1BbgY+AfgF+h9f+RjwJ8y+d/iPwHfB9YAHxnuGRzcHJHMbecAPwG+muQTSfZ6KNv+RiRV9UxVPdvWbwQOTXIs+390zbX0rlXfmOQ1wzs9HUySHNkewLqZ3ghlDb3/EaKqdrTLV88DV9K7vAqT/y2+ALwHeGuSj0/DKRy0HJHMYVX1TeCbSY4B3g/ckOQJeiOUh/Y3IknyS8BjVVVJVtD7H5OfAj+jPbqGXlCNAP92wu/+83at+qtJzmpzKdJASf4b8GvAXwHnVdUDE7YvrKodbf7kbODutmkzcGGSa+lNsj9dVTvG21XVPyb5XeC7SR6rqium4XQOOgaJqKqfAp8DPtcCYc8Um54DnJ9kN/B/gZHqPSphSo+uqaqPJbkSuCbJ+6rqhQNxPjoobQI+WFW797H9S+2SV4A7gT9s9RuBs4Ax4B+BF91qXlVPJlkJfCfJE1V1w4Hu/MHOR6RIkjpxjkSS1IlBIknqxCCRJHVikEiSOjFIJEmdGCTSAZDk+zPdB2mmePuvJKkTRyTSATD+bov2jpZvJ7k+yX1JvjT+tNokb03y/faQyy1JXpPkVUmuTLK1vZvlN9q+H0zytSRfb+/ZuDDJf2z7/CDJ0W2/Nyb5myR3JPlukn8+c/8Kmqv8Zrt04J0KnEzvmU7/C3hHki3AdcB7q+qHSY6g9zSAjwBU1a+2EPhmkvHH+J/SjvUqet/M/lhVnZpkPXAe8Of0HlD4h1X1QJK3AZcB75qm85QAg0Qahi1VtR0gyZ3AEuBpYEdV/RB6D7xs23+d3kuVqKr7kvw9P38fzLeqahewK8nTwNdbfSvw5iSvBt4O/FXfKzrGH58uTRuDRDrwnu9b30Pvv7Ow96P0x032kqb+47zQ9/mFdsxDgJ9V1bKX3VPpAHCORJoe9wFvSPJWgDY/Mh/4DvD7rfYm4Hh6b6ncrzaqeTDJua19krxlGJ2XJmOQSNOgPSb/vcDnk/wYuJne3MdlwLwkW+nNoXywvVNjqn4fWNOOuY29X2ksTQtv/5UkdeKIRJLUiUEiSerEIJEkdWKQSJI6MUgkSZ0YJJKkTgwSSVIn/x890TuUh/5/qQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "be30ea29", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "a6a7f016", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "dc213050", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5e753c8e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "4bfb74da", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4ac9c732", "metadata": {}, "source": [ "### 2. Tratamiento de missing, reparación dataset y codificación de variables\n", "\n", "1. Reemplazar <=50K por 0 y los >50K por 1" ] }, { "cell_type": "code", "execution_count": 12, "id": "ab8933f2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f4ba3013", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "dcdcc2e7", "metadata": {}, "source": [ "2. Eliminar la columna income Y fnlwgt, dejando solo las Características " ] }, { "cell_type": "code", "execution_count": 14, "id": "2deffe45", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e1ca69ac", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "37613f52", "metadata": {}, "source": [ "3. Obtener el nombre de las columns númericas para luego normalizarlas, ejemplo: age, capital-loss, hours-per-week" ] }, { "cell_type": "code", "execution_count": null, "id": "64c88d76", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f8db313b", "metadata": {}, "source": [ "4. Utilizar la función get_dummies() de pandas para codificar las variables categóricas como : workclass, education, etc." ] }, { "cell_type": "code", "execution_count": null, "id": "618a4b91", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "724e4271", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "254c963d", "metadata": {}, "source": [ "5. Normalizar los datos usando StandardScaler https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html" ] }, { "cell_type": "code", "execution_count": null, "id": "c9ce945e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 20, "id": "b251cd83", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "039aa289", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "545cd4f3", "metadata": {}, "source": [ "6. Crear el vector Y con las clases" ] }, { "cell_type": "code", "execution_count": null, "id": "771e83dc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "aa72dae5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "153b8a37", "metadata": {}, "source": [ "### 3. Determinar el conjunto de entrenamiento y el de validación." ] }, { "cell_type": "markdown", "id": "56ad8a80", "metadata": {}, "source": [ "1. Hacer división de los datos 80% train , 20% test Crear un vector X el cual contiene las características \n", "2. Imprimir el shape o dimensiones del vector de entrenamiento (x_train)\n", "2. Imprimir el shape o dimensiones del vector de prueba (x_test)\n", "Ayuda: usar la función train_test_split de sklearn https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\n" ] }, { "cell_type": "code", "execution_count": 24, "id": "1df16920", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "861c63b1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d7f8657d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "956287ee", "metadata": {}, "source": [ "### 4. Entrenamiento del modelo" ] }, { "cell_type": "markdown", "id": "7b18ba2e", "metadata": {}, "source": [ "1. Crear un RandomForestClassifier model usando la librería sklearn https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n", "2. Entrenar el modelo\n", "\n", "Ayudas:\n", "\n", "- Usar la función fit\n", "- Solo usar el conjunto de entrenamiento (X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "id": "cd1c9d84", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "da33a61b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0ab2e1ed", "metadata": {}, "source": [ "### 5. Calcular las métricas de evaluación" ] }, { "cell_type": "markdown", "id": "a0af39eb", "metadata": {}, "source": [ "**Nota:** Ejecutar la siguiente función, la cual calcula crea la matriz de confusión y algunas métricas. " ] }, { "cell_type": "code", "execution_count": 29, "id": "cd5297c5", "metadata": {}, "outputs": [], "source": [ " def metrics(y_true,y_pred):\n", " \"\"\"\n", " This method calculate some metrics shuch as acurracy,f1-score,precision and create confusion matrix figure.\n", "\n", " Args:\n", " y_true (numpy_array): true classes\n", " y_pred (numpy_array): predict classes\n", "\n", " Returns:\n", " \n", " cm_fig (ConfusionMatrixDisplay: Confusion matrix figure\n", " accuracy (float): acurracy\n", " report (dict): some metrics\n", "\n", " \"\"\"\n", " cm = confusion_matrix(y_true,y_pred, normalize='true')\n", " report = classification_report(y_true,y_pred,output_dict=True)\n", " cm_fig = ConfusionMatrixDisplay(confusion_matrix=cm)\n", " return cm_fig,report[\"accuracy\"],report" ] }, { "cell_type": "markdown", "id": "1b4bf45e", "metadata": {}, "source": [ "1. Usar la función predict() para crear el vector de predicciones\n", "\n", "Ayuda: Utilice el conjunto de test (X_test)" ] }, { "cell_type": "code", "execution_count": 30, "id": "6ec28171", "metadata": {}, "outputs": [], "source": [ "y_predict = " ] }, { "cell_type": "code", "execution_count": null, "id": "61d5da41", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "Utiliza la función metrics, debes reemplazar las variables\n", "y_test por las clases del conjunto de test y y_predict por las predicciones obtenidas de tu modelo.\n", "\n", "\"\"\"\n", "cm_fig,test_score, report = metrics(y_test,y_predict)\n", "cm_fig.plot(cmap=plt.cm.Blues)" ] }, { "cell_type": "markdown", "id": "a875f0a0", "metadata": {}, "source": [ "### 6. Conclusiones" ] }, { "cell_type": "markdown", "id": "5f619782", "metadata": {}, "source": [ "Describa brevemente los resultados obtenidos, incluyendo el accuracy y mencionando el comportamiento del modelo clasificando muestras para ambas clases." ] }, { "cell_type": "markdown", "id": "4f458fbb", "metadata": {}, "source": [ "\n", "Escribir conclusiones" ] }, { "cell_type": "markdown", "id": "44ca4281", "metadata": {}, "source": [ "# 2. Regresión" ] }, { "cell_type": "markdown", "id": "1bea7949", "metadata": {}, "source": [ "### Información del dataset\n", "\n", "https://www.kaggle.com/datasets/gunhee/koreahousedata\n", "\n", "### Apartment data\n", "\n", "Los datos de transacciones de apartamentos se generan entre agosto de 2007 y agosto de 2017 en el estricto Daebong, ciudad de Daegu, Corea del Sur\n" ] }, { "cell_type": "code", "execution_count": 101, "id": "f9b5e04f", "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"resources/Daegu_Real_Estate_data.csv\")" ] }, { "cell_type": "markdown", "id": "f3913d95", "metadata": {}, "source": [ "### Tarea\n", "\n", "Predecir el precio de un apartamento" ] }, { "cell_type": "markdown", "id": "70c256b3", "metadata": {}, "source": [ "### 1. Análisis exploratorio de los datos" ] }, { "cell_type": "markdown", "id": "1e914c4b", "metadata": {}, "source": [ "1. Imprima el número de registros del dataset\n", "2. Imprima el número de variables del dataset\n", "3. Imprima el nombre de las columnas del dataset\n", "4. Imprima el **head** del dataset\n", "5. Imprima el **tail** del dataset\n", "6. Imprima **info** basica del dataset\n", "7. Imprima un **describe** del dataset\n", "8. Realizar un gráfico de dispersión relacionando el Size(sqf) y el SalePrice de las viviendas.\n" ] }, { "cell_type": "code", "execution_count": 102, "id": "94511e24", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Número de registros 5891\n", "Número de variables 30\n" ] } ], "source": [ "print(\"Número de registros\",)\n", "print(\"Número de variables\",d)" ] }, { "cell_type": "code", "execution_count": null, "id": "adec743c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5a9c1762", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "4cc27b54", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "bc88d985", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "9a599fe4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "bbfb9414", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ec389d38", "metadata": {}, "source": [ "### 2. Tratamiento de missing, reparación dataset y codificación de variables" ] }, { "cell_type": "markdown", "id": "5917a1a2", "metadata": {}, "source": [ "1. Seleccionar la variable a predecir (SalePrice) crear un vector llamdo Y con dicha información" ] }, { "cell_type": "code", "execution_count": null, "id": "06f00ecf", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b823690f", "metadata": {}, "source": [ "2. Eliminar la columna SalePrice del dataset" ] }, { "cell_type": "code", "execution_count": 110, "id": "eb37fed0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4be0d94e", "metadata": {}, "source": [ "3. Identificar las columnas numericas para luego normalizar " ] }, { "cell_type": "code", "execution_count": null, "id": "e13f6f8c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "96f7c676", "metadata": {}, "source": [ "4. Transformar la variables categóricas usando el método get_dummies() de pandas" ] }, { "cell_type": "code", "execution_count": null, "id": "3323a72c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "cbf274eb", "metadata": {}, "source": [ "5. Normalizar solo las variables numericas previamente encontradas." ] }, { "cell_type": "code", "execution_count": null, "id": "24a88313", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "92674c0b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "961708e1", "metadata": {}, "source": [ "6. Imprimir el head del dataset resultante" ] }, { "cell_type": "code", "execution_count": null, "id": "830a7bab", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1f44b37f", "metadata": {}, "source": [ "### 3. Determinar el conjunto de entrenamiento y el de validación.\n", "\n", "\n", "1. Hacer división de los datos 80% train , 20% test Crear un vector X el cual contiene las características \n", "2. Imprimir el shape o dimensiones del vector de entrenamiento (x_train)\n", "2. Imprimir el shape o dimensiones del vector de prueba (x_test)\n", "Ayuda: usar la función train_test_split de sklearn https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fcfe125e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 119, "id": "cd47fbba", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dimensiones vector de entrenamiento (4712, 46)\n", "Dimensiones vector de prueba (1179, 46)\n" ] } ], "source": [ "print(\"Dimensiones vector de entrenamiento\", )\n", "print(\"Dimensiones vector de prueba\", )" ] }, { "cell_type": "markdown", "id": "4faa0b54", "metadata": {}, "source": [ "### 4. Entrenamiento del modelo\n", "\n", "\n", "1. Crear un RandomForestRegressor model usando la librería sklearn https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html\n", "\n", "2. Entrenar el modelo\n", "\n", "Ayudas:\n", "\n", "- Usar la función fit\n", "- Solo usar el conjunto de entrenamiento (X_train, y_train)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "878a71e6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0dd38bfa", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "665b5534", "metadata": {}, "source": [ "### 5. Calcular las métricas de evaluación\n", "\n", "**Nota:** Ejecutar la siguiente función, la cual calcula crea la matriz de confusión y algunas métricas. " ] }, { "cell_type": "markdown", "id": "82e61a22", "metadata": {}, "source": [ "1. Usar la función predict() para crear el vector de predicciones\n", "\n", "\n", "Ayuda: Utilice el conjunto de test (X_test)" ] }, { "cell_type": "code", "execution_count": 131, "id": "47d65de7", "metadata": {}, "outputs": [], "source": [ "y_predict = " ] }, { "cell_type": "markdown", "id": "a772c4e5", "metadata": {}, "source": [ "2. Calcular métricas de error" ] }, { "cell_type": "code", "execution_count": 137, "id": "17691248", "metadata": {}, "outputs": [], "source": [ "mae_test = m.mean_absolute_error(y_test, y_predict )\n", "mape_test = np.mean(np.abs((y_test - y_predict)/ y_test))\n", "MSE_test = mean_squared_error(y_test,y_predict)\n", "RMSE_test = mean_squared_error(y_test,y_predict,squared=False) \n", "R2_test = r2_score(y_test,y_predict)" ] }, { "cell_type": "code", "execution_count": null, "id": "7549632f", "metadata": {}, "outputs": [], "source": [ "print(\"MAE\",mae_test)\n", "print(\"MAPE\",mape_test)\n", "print(\"MSE\",MSE_test)\n", "print(\"RMSE\",RMSE_test)\n", "print(\"R2\",R2_test)" ] }, { "cell_type": "markdown", "id": "fff6abf1", "metadata": {}, "source": [ "### 6. Conclusiones\n", "\n", "Describa brevemente los resultados obtenidos" ] }, { "cell_type": "markdown", "id": "9db5b8d8", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "9f51468e", "metadata": {}, "source": [ "Realizar un gráfico de dispersión entre y_test y y_predict" ] }, { "cell_type": "code", "execution_count": null, "id": "93cb1d78", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }