{
"cells": [
{
"cell_type": "markdown",
"id": "16be11e8",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"id": "02d0018f",
"metadata": {},
"source": [
"# Regresión y Clasificación con Random Forests"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6a6cf0c1",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import classification_report,confusion_matrix,ConfusionMatrixDisplay\n",
"import seaborn as sns\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn import model_selection as ms\n",
"from sklearn.metrics import r2_score \n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn import metrics as m"
]
},
{
"cell_type": "markdown",
"id": "ff149f41",
"metadata": {},
"source": [
"# 1. Clasificación"
]
},
{
"cell_type": "markdown",
"id": "c4e5a4f0",
"metadata": {},
"source": [
"### Información del dataset\n",
"\n",
"### Adult income dataset\n",
"\n",
"https://archive.ics.uci.edu/ml/datasets/Adult\n",
"\n",
"\n",
"1. edad: continua.\n",
"2. clase de trabajo: Privado, Autónomo-no-inc, Autónomo-inc, Federal-gov, Local-gov, Estatal-gov, Sin-trabajo, Nunca-trabajo.\n",
"3. fnlwgt: continuo.\n",
"4. educación: Bachillerato, Algún tipo de universidad, 11º, Grado de secundaria, Escuela profesional, Asistente de dirección, Asistente de dirección, 9º, 7º-8º, 12º, Máster, 1º-4º, 10º, Doctorado, 5º-6º, Preescolar.\n",
"5. education-num: continuous.\n",
"6. estado civil: Casado-cónyuge, Divorciado, No casado, Separado, Viudo, Casado-cónyuge-ausente, Casado-cónyuge.\n",
"7. Ocupación: Apoyo técnico, Reparación artesanal, Otros servicios, Ventas, Directivo, Especialidad profesional, Manipulador-limpiador, Operador de maquinaria, Administrativo-empleado, Agricultura-pesca, Transporte-movimiento, Servicio doméstico privado, Servicio de protección, Fuerzas armadas.\n",
"8. Relación: Esposa, Hijo propio, Esposo, No familiar, Otro pariente, Soltero.\n",
"9. Raza: Blanco, Asiático-Pacífico-Islandés, Amerindio-Esquimal, Otro, Negro.\n",
"10. Sexo: Mujer, Hombre.\n",
"11. plusvalía: continua.\n",
"12. pérdida de capital: continua.\n",
"13. horas-semana: continuo.\n",
"14. país de origen: Estados Unidos, Camboya, Inglaterra, Puerto-Rico, Canadá, Alemania, Estados Unidos periféricos (Guam-USVI-etc), India, Japón, Grecia, Sur, China, Cuba, Irán, Honduras, Filipinas, Italia, Polonia, Jamaica, Vietnam, México, Portugal, Irlanda, Francia, República Dominicana, Laos, Ecuador, Taiwán, Haití, Colombia, Hungría, Guatemala, Nicaragua, Escocia, Tailandia, Yugoslavia, El Salvador, Trinad&Tobago, Perú, Hong, Holanda.\n",
"15. clase: >50K, <=50K"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5c686ceb",
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv('resources/adult_.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8f164d20",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" age \n",
" workclass \n",
" fnlwgt \n",
" education \n",
" educational-num \n",
" marital-status \n",
" occupation \n",
" relationship \n",
" race \n",
" gender \n",
" capital-gain \n",
" capital-loss \n",
" hours-per-week \n",
" native-country \n",
" income \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 25 \n",
" Private \n",
" 226802 \n",
" 11th \n",
" 7 \n",
" Never-married \n",
" Machine-op-inspct \n",
" Own-child \n",
" Black \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" 1 \n",
" 38 \n",
" Private \n",
" 89814 \n",
" HS-grad \n",
" 9 \n",
" Married-civ-spouse \n",
" Farming-fishing \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 50 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" 2 \n",
" 28 \n",
" Local-gov \n",
" 336951 \n",
" Assoc-acdm \n",
" 12 \n",
" Married-civ-spouse \n",
" Protective-serv \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" >50K \n",
" \n",
" \n",
" 3 \n",
" 44 \n",
" Private \n",
" 160323 \n",
" Some-college \n",
" 10 \n",
" Married-civ-spouse \n",
" Machine-op-inspct \n",
" Husband \n",
" Black \n",
" Male \n",
" 7688 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" >50K \n",
" \n",
" \n",
" 4 \n",
" 18 \n",
" ? \n",
" 103497 \n",
" Some-college \n",
" 10 \n",
" Never-married \n",
" ? \n",
" Own-child \n",
" White \n",
" Female \n",
" 0 \n",
" 0 \n",
" 30 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 48837 \n",
" 27 \n",
" Private \n",
" 257302 \n",
" Assoc-acdm \n",
" 12 \n",
" Married-civ-spouse \n",
" Tech-support \n",
" Wife \n",
" White \n",
" Female \n",
" 0 \n",
" 0 \n",
" 38 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" 48838 \n",
" 40 \n",
" Private \n",
" 154374 \n",
" HS-grad \n",
" 9 \n",
" Married-civ-spouse \n",
" Machine-op-inspct \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" >50K \n",
" \n",
" \n",
" 48839 \n",
" 58 \n",
" Private \n",
" 151910 \n",
" HS-grad \n",
" 9 \n",
" Widowed \n",
" Adm-clerical \n",
" Unmarried \n",
" White \n",
" Female \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" 48840 \n",
" 22 \n",
" Private \n",
" 201490 \n",
" HS-grad \n",
" 9 \n",
" Never-married \n",
" Adm-clerical \n",
" Own-child \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 20 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" 48841 \n",
" 52 \n",
" Self-emp-inc \n",
" 287927 \n",
" HS-grad \n",
" 9 \n",
" Married-civ-spouse \n",
" Exec-managerial \n",
" Wife \n",
" White \n",
" Female \n",
" 15024 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" >50K \n",
" \n",
" \n",
"
\n",
"
48842 rows × 15 columns
\n",
"
"
],
"text/plain": [
" age workclass fnlwgt education educational-num \\\n",
"0 25 Private 226802 11th 7 \n",
"1 38 Private 89814 HS-grad 9 \n",
"2 28 Local-gov 336951 Assoc-acdm 12 \n",
"3 44 Private 160323 Some-college 10 \n",
"4 18 ? 103497 Some-college 10 \n",
"... ... ... ... ... ... \n",
"48837 27 Private 257302 Assoc-acdm 12 \n",
"48838 40 Private 154374 HS-grad 9 \n",
"48839 58 Private 151910 HS-grad 9 \n",
"48840 22 Private 201490 HS-grad 9 \n",
"48841 52 Self-emp-inc 287927 HS-grad 9 \n",
"\n",
" marital-status occupation relationship race gender \\\n",
"0 Never-married Machine-op-inspct Own-child Black Male \n",
"1 Married-civ-spouse Farming-fishing Husband White Male \n",
"2 Married-civ-spouse Protective-serv Husband White Male \n",
"3 Married-civ-spouse Machine-op-inspct Husband Black Male \n",
"4 Never-married ? Own-child White Female \n",
"... ... ... ... ... ... \n",
"48837 Married-civ-spouse Tech-support Wife White Female \n",
"48838 Married-civ-spouse Machine-op-inspct Husband White Male \n",
"48839 Widowed Adm-clerical Unmarried White Female \n",
"48840 Never-married Adm-clerical Own-child White Male \n",
"48841 Married-civ-spouse Exec-managerial Wife White Female \n",
"\n",
" capital-gain capital-loss hours-per-week native-country income \n",
"0 0 0 40 United-States <=50K \n",
"1 0 0 50 United-States <=50K \n",
"2 0 0 40 United-States >50K \n",
"3 7688 0 40 United-States >50K \n",
"4 0 0 30 United-States <=50K \n",
"... ... ... ... ... ... \n",
"48837 0 0 38 United-States <=50K \n",
"48838 0 0 40 United-States >50K \n",
"48839 0 0 40 United-States <=50K \n",
"48840 0 0 20 United-States <=50K \n",
"48841 15024 0 40 United-States >50K \n",
"\n",
"[48842 rows x 15 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "markdown",
"id": "58b44e18",
"metadata": {},
"source": [
"### Tarea\n",
"\n",
" Predecir si los ingresos exceden los $50K/año según los datos del censo. También conocido como conjunto de datos de \"Ingresos del censo\". \n",
" \n",
"Tenemos dos clases: \n",
"\n",
"1. .<=50k\n",
"2. .>50k\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"id": "7e7e50de",
"metadata": {},
"source": [
"### 1. Análisis exploratorio de los datos"
]
},
{
"cell_type": "markdown",
"id": "e4b0f2b8",
"metadata": {},
"source": [
"1. Imprima el número de registros del dataset\n",
"2. Imprima el número de variables del dataset\n",
"3. Imprima el nombre de las columnas del dataset\n",
"4. Imprima el **head** del dataset\n",
"5. Imprima el **tail** del dataset\n",
"6. Imprima **info** basica del dataset\n",
"7. Imprima un **describe** del dataset\n",
"8. Graficar la distribución de clases usando un diagrama de barras (Recomendación: Usar la librería seaborn).\n",
"9. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "dab623a3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de regristros 48842\n",
"Número de variables 15\n"
]
}
],
"source": [
"print(\"Número de regristros\",data.shape[0])\n",
"print(\"Número de variables\",data.shape[1])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "545ec6ca",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" age \n",
" workclass \n",
" fnlwgt \n",
" education \n",
" educational-num \n",
" marital-status \n",
" occupation \n",
" relationship \n",
" race \n",
" gender \n",
" capital-gain \n",
" capital-loss \n",
" hours-per-week \n",
" native-country \n",
" income \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 25 \n",
" Private \n",
" 226802 \n",
" 11th \n",
" 7 \n",
" Never-married \n",
" Machine-op-inspct \n",
" Own-child \n",
" Black \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" 1 \n",
" 38 \n",
" Private \n",
" 89814 \n",
" HS-grad \n",
" 9 \n",
" Married-civ-spouse \n",
" Farming-fishing \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 50 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" 2 \n",
" 28 \n",
" Local-gov \n",
" 336951 \n",
" Assoc-acdm \n",
" 12 \n",
" Married-civ-spouse \n",
" Protective-serv \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" >50K \n",
" \n",
" \n",
" 3 \n",
" 44 \n",
" Private \n",
" 160323 \n",
" Some-college \n",
" 10 \n",
" Married-civ-spouse \n",
" Machine-op-inspct \n",
" Husband \n",
" Black \n",
" Male \n",
" 7688 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" >50K \n",
" \n",
" \n",
" 4 \n",
" 18 \n",
" ? \n",
" 103497 \n",
" Some-college \n",
" 10 \n",
" Never-married \n",
" ? \n",
" Own-child \n",
" White \n",
" Female \n",
" 0 \n",
" 0 \n",
" 30 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age workclass fnlwgt education educational-num marital-status \\\n",
"0 25 Private 226802 11th 7 Never-married \n",
"1 38 Private 89814 HS-grad 9 Married-civ-spouse \n",
"2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse \n",
"3 44 Private 160323 Some-college 10 Married-civ-spouse \n",
"4 18 ? 103497 Some-college 10 Never-married \n",
"\n",
" occupation relationship race gender capital-gain capital-loss \\\n",
"0 Machine-op-inspct Own-child Black Male 0 0 \n",
"1 Farming-fishing Husband White Male 0 0 \n",
"2 Protective-serv Husband White Male 0 0 \n",
"3 Machine-op-inspct Husband Black Male 7688 0 \n",
"4 ? Own-child White Female 0 0 \n",
"\n",
" hours-per-week native-country income \n",
"0 40 United-States <=50K \n",
"1 50 United-States <=50K \n",
"2 40 United-States >50K \n",
"3 40 United-States >50K \n",
"4 30 United-States <=50K "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "9649f77f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" age \n",
" workclass \n",
" fnlwgt \n",
" education \n",
" educational-num \n",
" marital-status \n",
" occupation \n",
" relationship \n",
" race \n",
" gender \n",
" capital-gain \n",
" capital-loss \n",
" hours-per-week \n",
" native-country \n",
" income \n",
" \n",
" \n",
" \n",
" \n",
" 48837 \n",
" 27 \n",
" Private \n",
" 257302 \n",
" Assoc-acdm \n",
" 12 \n",
" Married-civ-spouse \n",
" Tech-support \n",
" Wife \n",
" White \n",
" Female \n",
" 0 \n",
" 0 \n",
" 38 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" 48838 \n",
" 40 \n",
" Private \n",
" 154374 \n",
" HS-grad \n",
" 9 \n",
" Married-civ-spouse \n",
" Machine-op-inspct \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" >50K \n",
" \n",
" \n",
" 48839 \n",
" 58 \n",
" Private \n",
" 151910 \n",
" HS-grad \n",
" 9 \n",
" Widowed \n",
" Adm-clerical \n",
" Unmarried \n",
" White \n",
" Female \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" 48840 \n",
" 22 \n",
" Private \n",
" 201490 \n",
" HS-grad \n",
" 9 \n",
" Never-married \n",
" Adm-clerical \n",
" Own-child \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 20 \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" 48841 \n",
" 52 \n",
" Self-emp-inc \n",
" 287927 \n",
" HS-grad \n",
" 9 \n",
" Married-civ-spouse \n",
" Exec-managerial \n",
" Wife \n",
" White \n",
" Female \n",
" 15024 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" >50K \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age workclass fnlwgt education educational-num \\\n",
"48837 27 Private 257302 Assoc-acdm 12 \n",
"48838 40 Private 154374 HS-grad 9 \n",
"48839 58 Private 151910 HS-grad 9 \n",
"48840 22 Private 201490 HS-grad 9 \n",
"48841 52 Self-emp-inc 287927 HS-grad 9 \n",
"\n",
" marital-status occupation relationship race gender \\\n",
"48837 Married-civ-spouse Tech-support Wife White Female \n",
"48838 Married-civ-spouse Machine-op-inspct Husband White Male \n",
"48839 Widowed Adm-clerical Unmarried White Female \n",
"48840 Never-married Adm-clerical Own-child White Male \n",
"48841 Married-civ-spouse Exec-managerial Wife White Female \n",
"\n",
" capital-gain capital-loss hours-per-week native-country income \n",
"48837 0 0 38 United-States <=50K \n",
"48838 0 0 40 United-States >50K \n",
"48839 0 0 40 United-States <=50K \n",
"48840 0 0 20 United-States <=50K \n",
"48841 15024 0 40 United-States >50K "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.tail()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f8f8bb9f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 48842 entries, 0 to 48841\n",
"Data columns (total 15 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 age 48842 non-null int64 \n",
" 1 workclass 48842 non-null object\n",
" 2 fnlwgt 48842 non-null int64 \n",
" 3 education 48842 non-null object\n",
" 4 educational-num 48842 non-null int64 \n",
" 5 marital-status 48842 non-null object\n",
" 6 occupation 48842 non-null object\n",
" 7 relationship 48842 non-null object\n",
" 8 race 48842 non-null object\n",
" 9 gender 48842 non-null object\n",
" 10 capital-gain 48842 non-null int64 \n",
" 11 capital-loss 48842 non-null int64 \n",
" 12 hours-per-week 48842 non-null int64 \n",
" 13 native-country 48842 non-null object\n",
" 14 income 48842 non-null object\n",
"dtypes: int64(6), object(9)\n",
"memory usage: 5.6+ MB\n"
]
}
],
"source": [
"data.info()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9a48bedb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" count \n",
" mean \n",
" std \n",
" min \n",
" 25% \n",
" 50% \n",
" 75% \n",
" max \n",
" \n",
" \n",
" \n",
" \n",
" age \n",
" 48842.0 \n",
" 38.643585 \n",
" 13.710510 \n",
" 17.0 \n",
" 28.0 \n",
" 37.0 \n",
" 48.0 \n",
" 90.0 \n",
" \n",
" \n",
" fnlwgt \n",
" 48842.0 \n",
" 189664.134597 \n",
" 105604.025423 \n",
" 12285.0 \n",
" 117550.5 \n",
" 178144.5 \n",
" 237642.0 \n",
" 1490400.0 \n",
" \n",
" \n",
" educational-num \n",
" 48842.0 \n",
" 10.078089 \n",
" 2.570973 \n",
" 1.0 \n",
" 9.0 \n",
" 10.0 \n",
" 12.0 \n",
" 16.0 \n",
" \n",
" \n",
" capital-gain \n",
" 48842.0 \n",
" 1079.067626 \n",
" 7452.019058 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 99999.0 \n",
" \n",
" \n",
" capital-loss \n",
" 48842.0 \n",
" 87.502314 \n",
" 403.004552 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 4356.0 \n",
" \n",
" \n",
" hours-per-week \n",
" 48842.0 \n",
" 40.422382 \n",
" 12.391444 \n",
" 1.0 \n",
" 40.0 \n",
" 40.0 \n",
" 45.0 \n",
" 99.0 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count mean std min 25% \\\n",
"age 48842.0 38.643585 13.710510 17.0 28.0 \n",
"fnlwgt 48842.0 189664.134597 105604.025423 12285.0 117550.5 \n",
"educational-num 48842.0 10.078089 2.570973 1.0 9.0 \n",
"capital-gain 48842.0 1079.067626 7452.019058 0.0 0.0 \n",
"capital-loss 48842.0 87.502314 403.004552 0.0 0.0 \n",
"hours-per-week 48842.0 40.422382 12.391444 1.0 40.0 \n",
"\n",
" 50% 75% max \n",
"age 37.0 48.0 90.0 \n",
"fnlwgt 178144.5 237642.0 1490400.0 \n",
"educational-num 10.0 12.0 16.0 \n",
"capital-gain 0.0 0.0 99999.0 \n",
"capital-loss 0.0 0.0 4356.0 \n",
"hours-per-week 40.0 45.0 99.0 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.describe().T"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "7b0f44a1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" workclass \n",
" education \n",
" marital-status \n",
" occupation \n",
" relationship \n",
" race \n",
" gender \n",
" native-country \n",
" income \n",
" \n",
" \n",
" \n",
" \n",
" count \n",
" 48842 \n",
" 48842 \n",
" 48842 \n",
" 48842 \n",
" 48842 \n",
" 48842 \n",
" 48842 \n",
" 48842 \n",
" 48842 \n",
" \n",
" \n",
" unique \n",
" 9 \n",
" 16 \n",
" 7 \n",
" 15 \n",
" 6 \n",
" 5 \n",
" 2 \n",
" 42 \n",
" 2 \n",
" \n",
" \n",
" top \n",
" Private \n",
" HS-grad \n",
" Married-civ-spouse \n",
" Prof-specialty \n",
" Husband \n",
" White \n",
" Male \n",
" United-States \n",
" <=50K \n",
" \n",
" \n",
" freq \n",
" 33906 \n",
" 15784 \n",
" 22379 \n",
" 6172 \n",
" 19716 \n",
" 41762 \n",
" 32650 \n",
" 43832 \n",
" 37155 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" workclass education marital-status occupation relationship \\\n",
"count 48842 48842 48842 48842 48842 \n",
"unique 9 16 7 15 6 \n",
"top Private HS-grad Married-civ-spouse Prof-specialty Husband \n",
"freq 33906 15784 22379 6172 19716 \n",
"\n",
" race gender native-country income \n",
"count 48842 48842 48842 48842 \n",
"unique 5 2 42 2 \n",
"top White Male United-States <=50K \n",
"freq 41762 32650 43832 37155 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.describe(include='object')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "94c1f279",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['<=50K', '>50K'], dtype=object)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.income.unique()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "69bb63c2",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\ProgramData\\Anaconda3\\lib\\site-packages\\seaborn\\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.countplot(data.income)"
]
},
{
"cell_type": "markdown",
"id": "4ac9c732",
"metadata": {},
"source": [
"### 2. Tratamiento de missing, reparación dataset y codificación de variables\n",
"\n",
"1. Reemplazar <=50K por 0 y los >50K por 1"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "ab8933f2",
"metadata": {},
"outputs": [],
"source": [
"data.income=data.income.replace(['<=50K', '>50K'],[0,1])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f4ba3013",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" age \n",
" workclass \n",
" fnlwgt \n",
" education \n",
" educational-num \n",
" marital-status \n",
" occupation \n",
" relationship \n",
" race \n",
" gender \n",
" capital-gain \n",
" capital-loss \n",
" hours-per-week \n",
" native-country \n",
" income \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 25 \n",
" Private \n",
" 226802 \n",
" 11th \n",
" 7 \n",
" Never-married \n",
" Machine-op-inspct \n",
" Own-child \n",
" Black \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 38 \n",
" Private \n",
" 89814 \n",
" HS-grad \n",
" 9 \n",
" Married-civ-spouse \n",
" Farming-fishing \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 50 \n",
" United-States \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 28 \n",
" Local-gov \n",
" 336951 \n",
" Assoc-acdm \n",
" 12 \n",
" Married-civ-spouse \n",
" Protective-serv \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" 1 \n",
" \n",
" \n",
" 3 \n",
" 44 \n",
" Private \n",
" 160323 \n",
" Some-college \n",
" 10 \n",
" Married-civ-spouse \n",
" Machine-op-inspct \n",
" Husband \n",
" Black \n",
" Male \n",
" 7688 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" 1 \n",
" \n",
" \n",
" 4 \n",
" 18 \n",
" ? \n",
" 103497 \n",
" Some-college \n",
" 10 \n",
" Never-married \n",
" ? \n",
" Own-child \n",
" White \n",
" Female \n",
" 0 \n",
" 0 \n",
" 30 \n",
" United-States \n",
" 0 \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 48837 \n",
" 27 \n",
" Private \n",
" 257302 \n",
" Assoc-acdm \n",
" 12 \n",
" Married-civ-spouse \n",
" Tech-support \n",
" Wife \n",
" White \n",
" Female \n",
" 0 \n",
" 0 \n",
" 38 \n",
" United-States \n",
" 0 \n",
" \n",
" \n",
" 48838 \n",
" 40 \n",
" Private \n",
" 154374 \n",
" HS-grad \n",
" 9 \n",
" Married-civ-spouse \n",
" Machine-op-inspct \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" 1 \n",
" \n",
" \n",
" 48839 \n",
" 58 \n",
" Private \n",
" 151910 \n",
" HS-grad \n",
" 9 \n",
" Widowed \n",
" Adm-clerical \n",
" Unmarried \n",
" White \n",
" Female \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" 0 \n",
" \n",
" \n",
" 48840 \n",
" 22 \n",
" Private \n",
" 201490 \n",
" HS-grad \n",
" 9 \n",
" Never-married \n",
" Adm-clerical \n",
" Own-child \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 20 \n",
" United-States \n",
" 0 \n",
" \n",
" \n",
" 48841 \n",
" 52 \n",
" Self-emp-inc \n",
" 287927 \n",
" HS-grad \n",
" 9 \n",
" Married-civ-spouse \n",
" Exec-managerial \n",
" Wife \n",
" White \n",
" Female \n",
" 15024 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
48842 rows × 15 columns
\n",
"
"
],
"text/plain": [
" age workclass fnlwgt education educational-num \\\n",
"0 25 Private 226802 11th 7 \n",
"1 38 Private 89814 HS-grad 9 \n",
"2 28 Local-gov 336951 Assoc-acdm 12 \n",
"3 44 Private 160323 Some-college 10 \n",
"4 18 ? 103497 Some-college 10 \n",
"... ... ... ... ... ... \n",
"48837 27 Private 257302 Assoc-acdm 12 \n",
"48838 40 Private 154374 HS-grad 9 \n",
"48839 58 Private 151910 HS-grad 9 \n",
"48840 22 Private 201490 HS-grad 9 \n",
"48841 52 Self-emp-inc 287927 HS-grad 9 \n",
"\n",
" marital-status occupation relationship race gender \\\n",
"0 Never-married Machine-op-inspct Own-child Black Male \n",
"1 Married-civ-spouse Farming-fishing Husband White Male \n",
"2 Married-civ-spouse Protective-serv Husband White Male \n",
"3 Married-civ-spouse Machine-op-inspct Husband Black Male \n",
"4 Never-married ? Own-child White Female \n",
"... ... ... ... ... ... \n",
"48837 Married-civ-spouse Tech-support Wife White Female \n",
"48838 Married-civ-spouse Machine-op-inspct Husband White Male \n",
"48839 Widowed Adm-clerical Unmarried White Female \n",
"48840 Never-married Adm-clerical Own-child White Male \n",
"48841 Married-civ-spouse Exec-managerial Wife White Female \n",
"\n",
" capital-gain capital-loss hours-per-week native-country income \n",
"0 0 0 40 United-States 0 \n",
"1 0 0 50 United-States 0 \n",
"2 0 0 40 United-States 1 \n",
"3 7688 0 40 United-States 1 \n",
"4 0 0 30 United-States 0 \n",
"... ... ... ... ... ... \n",
"48837 0 0 38 United-States 0 \n",
"48838 0 0 40 United-States 1 \n",
"48839 0 0 40 United-States 0 \n",
"48840 0 0 20 United-States 0 \n",
"48841 15024 0 40 United-States 1 \n",
"\n",
"[48842 rows x 15 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "markdown",
"id": "dcdcc2e7",
"metadata": {},
"source": [
"2. Eliminar la columna income Y fnlwgt, dejando solo las Características "
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "2deffe45",
"metadata": {},
"outputs": [],
"source": [
"X = data.drop(['income','fnlwgt'],axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "e1ca69ac",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" age \n",
" workclass \n",
" education \n",
" educational-num \n",
" marital-status \n",
" occupation \n",
" relationship \n",
" race \n",
" gender \n",
" capital-gain \n",
" capital-loss \n",
" hours-per-week \n",
" native-country \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 25 \n",
" Private \n",
" 11th \n",
" 7 \n",
" Never-married \n",
" Machine-op-inspct \n",
" Own-child \n",
" Black \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" \n",
" \n",
" 1 \n",
" 38 \n",
" Private \n",
" HS-grad \n",
" 9 \n",
" Married-civ-spouse \n",
" Farming-fishing \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 50 \n",
" United-States \n",
" \n",
" \n",
" 2 \n",
" 28 \n",
" Local-gov \n",
" Assoc-acdm \n",
" 12 \n",
" Married-civ-spouse \n",
" Protective-serv \n",
" Husband \n",
" White \n",
" Male \n",
" 0 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" \n",
" \n",
" 3 \n",
" 44 \n",
" Private \n",
" Some-college \n",
" 10 \n",
" Married-civ-spouse \n",
" Machine-op-inspct \n",
" Husband \n",
" Black \n",
" Male \n",
" 7688 \n",
" 0 \n",
" 40 \n",
" United-States \n",
" \n",
" \n",
" 4 \n",
" 18 \n",
" ? \n",
" Some-college \n",
" 10 \n",
" Never-married \n",
" ? \n",
" Own-child \n",
" White \n",
" Female \n",
" 0 \n",
" 0 \n",
" 30 \n",
" United-States \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age workclass education educational-num marital-status \\\n",
"0 25 Private 11th 7 Never-married \n",
"1 38 Private HS-grad 9 Married-civ-spouse \n",
"2 28 Local-gov Assoc-acdm 12 Married-civ-spouse \n",
"3 44 Private Some-college 10 Married-civ-spouse \n",
"4 18 ? Some-college 10 Never-married \n",
"\n",
" occupation relationship race gender capital-gain capital-loss \\\n",
"0 Machine-op-inspct Own-child Black Male 0 0 \n",
"1 Farming-fishing Husband White Male 0 0 \n",
"2 Protective-serv Husband White Male 0 0 \n",
"3 Machine-op-inspct Husband Black Male 7688 0 \n",
"4 ? Own-child White Female 0 0 \n",
"\n",
" hours-per-week native-country \n",
"0 40 United-States \n",
"1 50 United-States \n",
"2 40 United-States \n",
"3 40 United-States \n",
"4 30 United-States "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head()"
]
},
{
"cell_type": "markdown",
"id": "37613f52",
"metadata": {},
"source": [
"3. Obtener el nombre de las columns númericas para luego normalizarlas, ejemplo: age, capital-loss, hours-per-week"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "64c88d76",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['age', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numericalcols=list(X.select_dtypes(exclude='object').columns)\n",
"\n",
"numericalcols"
]
},
{
"cell_type": "markdown",
"id": "f8db313b",
"metadata": {},
"source": [
"4. Utilizar la función get_dummies() de pandas para codificar las variables categóricas como : workclass, education, etc."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "618a4b91",
"metadata": {},
"outputs": [],
"source": [
"X=pd.get_dummies(X)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "724e4271",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" age \n",
" educational-num \n",
" capital-gain \n",
" capital-loss \n",
" hours-per-week \n",
" workclass_? \n",
" workclass_Federal-gov \n",
" workclass_Local-gov \n",
" workclass_Never-worked \n",
" workclass_Private \n",
" ... \n",
" native-country_Portugal \n",
" native-country_Puerto-Rico \n",
" native-country_Scotland \n",
" native-country_South \n",
" native-country_Taiwan \n",
" native-country_Thailand \n",
" native-country_Trinadad&Tobago \n",
" native-country_United-States \n",
" native-country_Vietnam \n",
" native-country_Yugoslavia \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 25 \n",
" 7 \n",
" 0 \n",
" 0 \n",
" 40 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 38 \n",
" 9 \n",
" 0 \n",
" 0 \n",
" 50 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 28 \n",
" 12 \n",
" 0 \n",
" 0 \n",
" 40 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 44 \n",
" 10 \n",
" 7688 \n",
" 0 \n",
" 40 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" 18 \n",
" 10 \n",
" 0 \n",
" 0 \n",
" 30 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
5 rows × 107 columns
\n",
"
"
],
"text/plain": [
" age educational-num capital-gain capital-loss hours-per-week \\\n",
"0 25 7 0 0 40 \n",
"1 38 9 0 0 50 \n",
"2 28 12 0 0 40 \n",
"3 44 10 7688 0 40 \n",
"4 18 10 0 0 30 \n",
"\n",
" workclass_? workclass_Federal-gov workclass_Local-gov \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 1 \n",
"3 0 0 0 \n",
"4 1 0 0 \n",
"\n",
" workclass_Never-worked workclass_Private ... native-country_Portugal \\\n",
"0 0 1 ... 0 \n",
"1 0 1 ... 0 \n",
"2 0 0 ... 0 \n",
"3 0 1 ... 0 \n",
"4 0 0 ... 0 \n",
"\n",
" native-country_Puerto-Rico native-country_Scotland native-country_South \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" native-country_Taiwan native-country_Thailand \\\n",
"0 0 0 \n",
"1 0 0 \n",
"2 0 0 \n",
"3 0 0 \n",
"4 0 0 \n",
"\n",
" native-country_Trinadad&Tobago native-country_United-States \\\n",
"0 0 1 \n",
"1 0 1 \n",
"2 0 1 \n",
"3 0 1 \n",
"4 0 1 \n",
"\n",
" native-country_Vietnam native-country_Yugoslavia \n",
"0 0 0 \n",
"1 0 0 \n",
"2 0 0 \n",
"3 0 0 \n",
"4 0 0 \n",
"\n",
"[5 rows x 107 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head()"
]
},
{
"cell_type": "markdown",
"id": "254c963d",
"metadata": {},
"source": [
"5. Normalizar los datos usando StandardScaler https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "c9ce945e",
"metadata": {},
"outputs": [],
"source": [
"normalizer =StandardScaler()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "b251cd83",
"metadata": {},
"outputs": [],
"source": [
"X[numericalcols]=normalizer.fit_transform(X[numericalcols])"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "039aa289",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" age \n",
" educational-num \n",
" capital-gain \n",
" capital-loss \n",
" hours-per-week \n",
" workclass_? \n",
" workclass_Federal-gov \n",
" workclass_Local-gov \n",
" workclass_Never-worked \n",
" workclass_Private \n",
" ... \n",
" native-country_Portugal \n",
" native-country_Puerto-Rico \n",
" native-country_Scotland \n",
" native-country_South \n",
" native-country_Taiwan \n",
" native-country_Thailand \n",
" native-country_Trinadad&Tobago \n",
" native-country_United-States \n",
" native-country_Vietnam \n",
" native-country_Yugoslavia \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" -0.995129 \n",
" -1.197259 \n",
" -0.144804 \n",
" -0.217127 \n",
" -0.034087 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" -0.046942 \n",
" -0.419335 \n",
" -0.144804 \n",
" -0.217127 \n",
" 0.772930 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" -0.776316 \n",
" 0.747550 \n",
" -0.144804 \n",
" -0.217127 \n",
" -0.034087 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 0.390683 \n",
" -0.030373 \n",
" 0.886874 \n",
" -0.217127 \n",
" -0.034087 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" -1.505691 \n",
" -0.030373 \n",
" -0.144804 \n",
" -0.217127 \n",
" -0.841104 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
5 rows × 107 columns
\n",
"
"
],
"text/plain": [
" age educational-num capital-gain capital-loss hours-per-week \\\n",
"0 -0.995129 -1.197259 -0.144804 -0.217127 -0.034087 \n",
"1 -0.046942 -0.419335 -0.144804 -0.217127 0.772930 \n",
"2 -0.776316 0.747550 -0.144804 -0.217127 -0.034087 \n",
"3 0.390683 -0.030373 0.886874 -0.217127 -0.034087 \n",
"4 -1.505691 -0.030373 -0.144804 -0.217127 -0.841104 \n",
"\n",
" workclass_? workclass_Federal-gov workclass_Local-gov \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 1 \n",
"3 0 0 0 \n",
"4 1 0 0 \n",
"\n",
" workclass_Never-worked workclass_Private ... native-country_Portugal \\\n",
"0 0 1 ... 0 \n",
"1 0 1 ... 0 \n",
"2 0 0 ... 0 \n",
"3 0 1 ... 0 \n",
"4 0 0 ... 0 \n",
"\n",
" native-country_Puerto-Rico native-country_Scotland native-country_South \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" native-country_Taiwan native-country_Thailand \\\n",
"0 0 0 \n",
"1 0 0 \n",
"2 0 0 \n",
"3 0 0 \n",
"4 0 0 \n",
"\n",
" native-country_Trinadad&Tobago native-country_United-States \\\n",
"0 0 1 \n",
"1 0 1 \n",
"2 0 1 \n",
"3 0 1 \n",
"4 0 1 \n",
"\n",
" native-country_Vietnam native-country_Yugoslavia \n",
"0 0 0 \n",
"1 0 0 \n",
"2 0 0 \n",
"3 0 0 \n",
"4 0 0 \n",
"\n",
"[5 rows x 107 columns]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head()"
]
},
{
"cell_type": "markdown",
"id": "545cd4f3",
"metadata": {},
"source": [
"6. Crear el vector Y con las clases"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "771e83dc",
"metadata": {},
"outputs": [],
"source": [
"Y=data.income"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "aa72dae5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 1\n",
"3 1\n",
"4 0\n",
" ..\n",
"48837 0\n",
"48838 1\n",
"48839 0\n",
"48840 0\n",
"48841 1\n",
"Name: income, Length: 48842, dtype: int64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Y"
]
},
{
"cell_type": "markdown",
"id": "153b8a37",
"metadata": {},
"source": [
"### 3. Determinar el conjunto de entrenamiento y el de validación."
]
},
{
"cell_type": "markdown",
"id": "56ad8a80",
"metadata": {},
"source": [
"1. Hacer división de los datos 80% train , 20% test Crear un vector X el cual contiene las características \n",
"2. Imprimir el shape o dimensiones del vector de entrenamiento (x_train)\n",
"2. Imprimir el shape o dimensiones del vector de prueba (x_test)\n",
"Ayuda: usar la función train_test_split de sklearn https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\n"
]
},
{
"cell_type": "code",
"execution_count": 97,
"id": "1df16920",
"metadata": {},
"outputs": [],
"source": [
"x_train,x_test,y_train,y_test=train_test_split(X,Y,random_state=89,test_size=0.2)"
]
},
{
"cell_type": "code",
"execution_count": 98,
"id": "861c63b1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dimensiones vector de entrenamiento (39073, 107)\n"
]
}
],
"source": [
"print(\"Dimensiones vector de entrenamiento\", x_train.shape)"
]
},
{
"cell_type": "code",
"execution_count": 99,
"id": "d7f8657d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dimensiones vector de prueba (9769, 107)\n"
]
}
],
"source": [
"print(\"Dimensiones vector de prueba\", x_test.shape)"
]
},
{
"cell_type": "markdown",
"id": "956287ee",
"metadata": {},
"source": [
"### 4. Entrenamiento del modelo"
]
},
{
"cell_type": "markdown",
"id": "7b18ba2e",
"metadata": {},
"source": [
"1. Crear un RandomForestClassifier model usando la librería sklearn https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n",
"2. Entrenar el modelo\n",
"\n",
"Ayudas:\n",
"\n",
"- Usar la función fit\n",
"- Solo usar el conjunto de entrenamiento (X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 100,
"id": "cd1c9d84",
"metadata": {},
"outputs": [],
"source": [
"clf = RandomForestClassifier()"
]
},
{
"cell_type": "code",
"execution_count": 101,
"id": "da33a61b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier()"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf.fit(x_train,y_train)"
]
},
{
"cell_type": "markdown",
"id": "0ab2e1ed",
"metadata": {},
"source": [
"### 5. Calcular las métricas de evaluación"
]
},
{
"cell_type": "markdown",
"id": "a0af39eb",
"metadata": {},
"source": [
"**Nota:** Ejecutar la siguiente función, la cual calcula crea la matriz de confusión y algunas métricas. "
]
},
{
"cell_type": "code",
"execution_count": 102,
"id": "cd5297c5",
"metadata": {},
"outputs": [],
"source": [
" def metrics(y_true,y_pred):\n",
" \"\"\"\n",
" This method calculate some metrics shuch as acurracy,f1-score,precision and create confusion matrix figure.\n",
"\n",
" Args:\n",
" y_true (numpy_array): true classes\n",
" y_pred (numpy_array): predict classes\n",
"\n",
" Returns:\n",
" \n",
" cm_fig (ConfusionMatrixDisplay: Confusion matrix figure\n",
" accuracy (float): acurracy\n",
" report (dict): some metrics\n",
"\n",
" \"\"\"\n",
" cm = confusion_matrix(y_true,y_pred, normalize='true')\n",
" report = classification_report(y_true,y_pred,output_dict=True)\n",
" cm_fig = ConfusionMatrixDisplay(confusion_matrix=cm)\n",
" return cm_fig,report[\"accuracy\"],report"
]
},
{
"cell_type": "markdown",
"id": "1b4bf45e",
"metadata": {},
"source": [
"1. Usar la función predict() para crear el vector de predicciones\n",
"\n",
"Ayuda: Utilice el conjunto de test (X_test)"
]
},
{
"cell_type": "code",
"execution_count": 103,
"id": "6ec28171",
"metadata": {},
"outputs": [],
"source": [
"y_predict = clf.predict(x_test)"
]
},
{
"cell_type": "code",
"execution_count": 104,
"id": "42747dc9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"\"\"\"\n",
"Utiliza la función metrics, debes reemplazar las variables\n",
"y_test por las clases del conjunto de test y y_predict por las predicciones obtenidas de tu modelo.\n",
"\n",
"\"\"\"\n",
"cm_fig,test_score, report = metrics(y_test,y_predict)\n",
"cm_fig.plot(cmap=plt.cm.Blues)"
]
},
{
"cell_type": "markdown",
"id": "a875f0a0",
"metadata": {},
"source": [
"### 6. Conclusiones"
]
},
{
"cell_type": "markdown",
"id": "5f619782",
"metadata": {},
"source": [
"Describa brevemente los resultados obtenidos, incluyendo el accuracy y mencionando el comportamiento del modelo clasificando muestras para ambas clases."
]
},
{
"cell_type": "markdown",
"id": "4f458fbb",
"metadata": {},
"source": [
"El modelo es bueno clasificando muestras de la clase 0 (ingresos <=50K) ya que logra un desempeño 92% de clasificaciones correctas sin embargo respecto a la clase 1 (ingresos >50K) su comportamiento es malo debido a que solo clasifica correctamente el 62% de las muestras correctamente esto se puede explicar ya que el dataset se encuentra desbalanceado (hay más muestras de una clase que otra). Por lo tanto se recomienda usara una metodología de validación como StratifiedKFold."
]
},
{
"cell_type": "markdown",
"id": "44ca4281",
"metadata": {},
"source": [
"# 2. Regresión"
]
},
{
"cell_type": "markdown",
"id": "1bea7949",
"metadata": {},
"source": [
"### Información del dataset\n",
"\n",
"https://www.kaggle.com/datasets/gunhee/koreahousedata\n",
"\n",
"### Apartment data\n",
"\n",
"Los datos de transacciones de apartamentos se generan entre agosto de 2007 y agosto de 2017 en el estricto Daebong, ciudad de Daegu, Corea del Sur\n"
]
},
{
"cell_type": "code",
"execution_count": 105,
"id": "f9b5e04f",
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv(\"resources/Daegu_Real_Estate_data.csv\")"
]
},
{
"cell_type": "markdown",
"id": "f3913d95",
"metadata": {},
"source": [
"### Tarea\n",
"\n",
"Predecir el precio de un apartamento"
]
},
{
"cell_type": "markdown",
"id": "70c256b3",
"metadata": {},
"source": [
"### 1. Análisis exploratorio de los datos"
]
},
{
"cell_type": "markdown",
"id": "1e914c4b",
"metadata": {},
"source": [
"1. Imprima el número de registros del dataset\n",
"2. Imprima el número de variables del dataset\n",
"3. Imprima el nombre de las columnas del dataset\n",
"4. Imprima el **head** del dataset\n",
"5. Imprima el **tail** del dataset\n",
"6. Imprima **info** basica del dataset\n",
"7. Imprima un **describe** del dataset\n",
"8. Realizar un gráfico de dispersión relacionando el Size(sqf) y el SalePrice de las viviendas.\n"
]
},
{
"cell_type": "code",
"execution_count": 106,
"id": "94511e24",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de registros 5891\n",
"Número de variables 30\n"
]
}
],
"source": [
"print(\"Número de registros\",data.shape[0])\n",
"print(\"Número de variables\",data.shape[1])"
]
},
{
"cell_type": "code",
"execution_count": 107,
"id": "adec743c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['SalePrice', 'YearBuilt', 'YrSold', 'MonthSold', 'Size(sqf)', 'Floor',\n",
" 'HallwayType', 'HeatingType', 'AptManageType', 'N_Parkinglot(Ground)',\n",
" 'N_Parkinglot(Basement)', 'TimeToBusStop', 'TimeToSubway', 'N_APT',\n",
" 'N_manager', 'N_elevators', 'SubwayStation',\n",
" 'N_FacilitiesNearBy(PublicOffice)', 'N_FacilitiesNearBy(Hospital)',\n",
" 'N_FacilitiesNearBy(Dpartmentstore)', 'N_FacilitiesNearBy(Mall)',\n",
" 'N_FacilitiesNearBy(ETC)', 'N_FacilitiesNearBy(Park)',\n",
" 'N_SchoolNearBy(Elementary)', 'N_SchoolNearBy(Middle)',\n",
" 'N_SchoolNearBy(High)', 'N_SchoolNearBy(University)',\n",
" 'N_FacilitiesInApt', 'N_FacilitiesNearBy(Total)',\n",
" 'N_SchoolNearBy(Total)'],\n",
" dtype='object')"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.columns"
]
},
{
"cell_type": "code",
"execution_count": 108,
"id": "5a9c1762",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" SalePrice \n",
" YearBuilt \n",
" YrSold \n",
" MonthSold \n",
" Size(sqf) \n",
" Floor \n",
" HallwayType \n",
" HeatingType \n",
" AptManageType \n",
" N_Parkinglot(Ground) \n",
" ... \n",
" N_FacilitiesNearBy(Mall) \n",
" N_FacilitiesNearBy(ETC) \n",
" N_FacilitiesNearBy(Park) \n",
" N_SchoolNearBy(Elementary) \n",
" N_SchoolNearBy(Middle) \n",
" N_SchoolNearBy(High) \n",
" N_SchoolNearBy(University) \n",
" N_FacilitiesInApt \n",
" N_FacilitiesNearBy(Total) \n",
" N_SchoolNearBy(Total) \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 141592 \n",
" 2006 \n",
" 2007 \n",
" 8 \n",
" 814 \n",
" 3 \n",
" terraced \n",
" individual_heating \n",
" management_in_trust \n",
" 111.0 \n",
" ... \n",
" 1.0 \n",
" 1.0 \n",
" 0.0 \n",
" 3.0 \n",
" 2.0 \n",
" 2.0 \n",
" 2.0 \n",
" 5 \n",
" 6.0 \n",
" 9.0 \n",
" \n",
" \n",
" 1 \n",
" 51327 \n",
" 1985 \n",
" 2007 \n",
" 8 \n",
" 587 \n",
" 8 \n",
" corridor \n",
" individual_heating \n",
" self_management \n",
" 80.0 \n",
" ... \n",
" 1.0 \n",
" 2.0 \n",
" 1.0 \n",
" 2.0 \n",
" 1.0 \n",
" 1.0 \n",
" 0.0 \n",
" 3 \n",
" 12.0 \n",
" 4.0 \n",
" \n",
" \n",
" 2 \n",
" 48672 \n",
" 1985 \n",
" 2007 \n",
" 8 \n",
" 587 \n",
" 6 \n",
" corridor \n",
" individual_heating \n",
" self_management \n",
" 80.0 \n",
" ... \n",
" 1.0 \n",
" 2.0 \n",
" 1.0 \n",
" 2.0 \n",
" 1.0 \n",
" 1.0 \n",
" 0.0 \n",
" 3 \n",
" 12.0 \n",
" 4.0 \n",
" \n",
" \n",
" 3 \n",
" 380530 \n",
" 2006 \n",
" 2007 \n",
" 8 \n",
" 2056 \n",
" 8 \n",
" terraced \n",
" individual_heating \n",
" management_in_trust \n",
" 249.0 \n",
" ... \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 2.0 \n",
" 2.0 \n",
" 1.0 \n",
" 2.0 \n",
" 5 \n",
" 3.0 \n",
" 7.0 \n",
" \n",
" \n",
" 4 \n",
" 221238 \n",
" 1993 \n",
" 2007 \n",
" 8 \n",
" 1761 \n",
" 3 \n",
" mixed \n",
" individual_heating \n",
" management_in_trust \n",
" 523.0 \n",
" ... \n",
" 1.0 \n",
" 5.0 \n",
" 0.0 \n",
" 4.0 \n",
" 3.0 \n",
" 5.0 \n",
" 5.0 \n",
" 4 \n",
" 14.0 \n",
" 17.0 \n",
" \n",
" \n",
"
\n",
"
5 rows × 30 columns
\n",
"
"
],
"text/plain": [
" SalePrice YearBuilt YrSold MonthSold Size(sqf) Floor HallwayType \\\n",
"0 141592 2006 2007 8 814 3 terraced \n",
"1 51327 1985 2007 8 587 8 corridor \n",
"2 48672 1985 2007 8 587 6 corridor \n",
"3 380530 2006 2007 8 2056 8 terraced \n",
"4 221238 1993 2007 8 1761 3 mixed \n",
"\n",
" HeatingType AptManageType N_Parkinglot(Ground) ... \\\n",
"0 individual_heating management_in_trust 111.0 ... \n",
"1 individual_heating self_management 80.0 ... \n",
"2 individual_heating self_management 80.0 ... \n",
"3 individual_heating management_in_trust 249.0 ... \n",
"4 individual_heating management_in_trust 523.0 ... \n",
"\n",
" N_FacilitiesNearBy(Mall) N_FacilitiesNearBy(ETC) N_FacilitiesNearBy(Park) \\\n",
"0 1.0 1.0 0.0 \n",
"1 1.0 2.0 1.0 \n",
"2 1.0 2.0 1.0 \n",
"3 1.0 0.0 0.0 \n",
"4 1.0 5.0 0.0 \n",
"\n",
" N_SchoolNearBy(Elementary) N_SchoolNearBy(Middle) N_SchoolNearBy(High) \\\n",
"0 3.0 2.0 2.0 \n",
"1 2.0 1.0 1.0 \n",
"2 2.0 1.0 1.0 \n",
"3 2.0 2.0 1.0 \n",
"4 4.0 3.0 5.0 \n",
"\n",
" N_SchoolNearBy(University) N_FacilitiesInApt N_FacilitiesNearBy(Total) \\\n",
"0 2.0 5 6.0 \n",
"1 0.0 3 12.0 \n",
"2 0.0 3 12.0 \n",
"3 2.0 5 3.0 \n",
"4 5.0 4 14.0 \n",
"\n",
" N_SchoolNearBy(Total) \n",
"0 9.0 \n",
"1 4.0 \n",
"2 4.0 \n",
"3 7.0 \n",
"4 17.0 \n",
"\n",
"[5 rows x 30 columns]"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 109,
"id": "4cc27b54",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" SalePrice \n",
" YearBuilt \n",
" YrSold \n",
" MonthSold \n",
" Size(sqf) \n",
" Floor \n",
" HallwayType \n",
" HeatingType \n",
" AptManageType \n",
" N_Parkinglot(Ground) \n",
" ... \n",
" N_FacilitiesNearBy(Mall) \n",
" N_FacilitiesNearBy(ETC) \n",
" N_FacilitiesNearBy(Park) \n",
" N_SchoolNearBy(Elementary) \n",
" N_SchoolNearBy(Middle) \n",
" N_SchoolNearBy(High) \n",
" N_SchoolNearBy(University) \n",
" N_FacilitiesInApt \n",
" N_FacilitiesNearBy(Total) \n",
" N_SchoolNearBy(Total) \n",
" \n",
" \n",
" \n",
" \n",
" 5886 \n",
" 511504 \n",
" 2007 \n",
" 2017 \n",
" 8 \n",
" 1643 \n",
" 19 \n",
" terraced \n",
" individual_heating \n",
" management_in_trust \n",
" 0.0 \n",
" ... \n",
" 1.0 \n",
" 0.0 \n",
" 2.0 \n",
" 3.0 \n",
" 3.0 \n",
" 2.0 \n",
" 2.0 \n",
" 10 \n",
" 9.0 \n",
" 10.0 \n",
" \n",
" \n",
" 5887 \n",
" 298230 \n",
" 2006 \n",
" 2017 \n",
" 8 \n",
" 903 \n",
" 13 \n",
" terraced \n",
" individual_heating \n",
" management_in_trust \n",
" 123.0 \n",
" ... \n",
" 1.0 \n",
" 2.0 \n",
" 0.0 \n",
" 4.0 \n",
" 3.0 \n",
" 3.0 \n",
" 1.0 \n",
" 4 \n",
" 8.0 \n",
" 11.0 \n",
" \n",
" \n",
" 5888 \n",
" 357522 \n",
" 2007 \n",
" 2017 \n",
" 8 \n",
" 868 \n",
" 20 \n",
" terraced \n",
" individual_heating \n",
" management_in_trust \n",
" 0.0 \n",
" ... \n",
" 1.0 \n",
" 0.0 \n",
" 2.0 \n",
" 3.0 \n",
" 3.0 \n",
" 2.0 \n",
" 2.0 \n",
" 10 \n",
" 9.0 \n",
" 10.0 \n",
" \n",
" \n",
" 5889 \n",
" 312389 \n",
" 1978 \n",
" 2017 \n",
" 8 \n",
" 1327 \n",
" 1 \n",
" corridor \n",
" individual_heating \n",
" self_management \n",
" 87.0 \n",
" ... \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 3.0 \n",
" 3.0 \n",
" 3.0 \n",
" 2.0 \n",
" 3 \n",
" 7.0 \n",
" 11.0 \n",
" \n",
" \n",
" 5890 \n",
" 393805 \n",
" 2007 \n",
" 2017 \n",
" 8 \n",
" 868 \n",
" 13 \n",
" terraced \n",
" individual_heating \n",
" management_in_trust \n",
" 0.0 \n",
" ... \n",
" 1.0 \n",
" 0.0 \n",
" 2.0 \n",
" 3.0 \n",
" 3.0 \n",
" 2.0 \n",
" 2.0 \n",
" 10 \n",
" 9.0 \n",
" 10.0 \n",
" \n",
" \n",
"
\n",
"
5 rows × 30 columns
\n",
"
"
],
"text/plain": [
" SalePrice YearBuilt YrSold MonthSold Size(sqf) Floor HallwayType \\\n",
"5886 511504 2007 2017 8 1643 19 terraced \n",
"5887 298230 2006 2017 8 903 13 terraced \n",
"5888 357522 2007 2017 8 868 20 terraced \n",
"5889 312389 1978 2017 8 1327 1 corridor \n",
"5890 393805 2007 2017 8 868 13 terraced \n",
"\n",
" HeatingType AptManageType N_Parkinglot(Ground) ... \\\n",
"5886 individual_heating management_in_trust 0.0 ... \n",
"5887 individual_heating management_in_trust 123.0 ... \n",
"5888 individual_heating management_in_trust 0.0 ... \n",
"5889 individual_heating self_management 87.0 ... \n",
"5890 individual_heating management_in_trust 0.0 ... \n",
"\n",
" N_FacilitiesNearBy(Mall) N_FacilitiesNearBy(ETC) \\\n",
"5886 1.0 0.0 \n",
"5887 1.0 2.0 \n",
"5888 1.0 0.0 \n",
"5889 1.0 0.0 \n",
"5890 1.0 0.0 \n",
"\n",
" N_FacilitiesNearBy(Park) N_SchoolNearBy(Elementary) \\\n",
"5886 2.0 3.0 \n",
"5887 0.0 4.0 \n",
"5888 2.0 3.0 \n",
"5889 0.0 3.0 \n",
"5890 2.0 3.0 \n",
"\n",
" N_SchoolNearBy(Middle) N_SchoolNearBy(High) N_SchoolNearBy(University) \\\n",
"5886 3.0 2.0 2.0 \n",
"5887 3.0 3.0 1.0 \n",
"5888 3.0 2.0 2.0 \n",
"5889 3.0 3.0 2.0 \n",
"5890 3.0 2.0 2.0 \n",
"\n",
" N_FacilitiesInApt N_FacilitiesNearBy(Total) N_SchoolNearBy(Total) \n",
"5886 10 9.0 10.0 \n",
"5887 4 8.0 11.0 \n",
"5888 10 9.0 10.0 \n",
"5889 3 7.0 11.0 \n",
"5890 10 9.0 10.0 \n",
"\n",
"[5 rows x 30 columns]"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.tail()"
]
},
{
"cell_type": "code",
"execution_count": 110,
"id": "bc88d985",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 5891 entries, 0 to 5890\n",
"Data columns (total 30 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 SalePrice 5891 non-null int64 \n",
" 1 YearBuilt 5891 non-null int64 \n",
" 2 YrSold 5891 non-null int64 \n",
" 3 MonthSold 5891 non-null int64 \n",
" 4 Size(sqf) 5891 non-null int64 \n",
" 5 Floor 5891 non-null int64 \n",
" 6 HallwayType 5891 non-null object \n",
" 7 HeatingType 5891 non-null object \n",
" 8 AptManageType 5891 non-null object \n",
" 9 N_Parkinglot(Ground) 5891 non-null float64\n",
" 10 N_Parkinglot(Basement) 5891 non-null float64\n",
" 11 TimeToBusStop 5891 non-null object \n",
" 12 TimeToSubway 5891 non-null object \n",
" 13 N_APT 5891 non-null float64\n",
" 14 N_manager 5891 non-null float64\n",
" 15 N_elevators 5891 non-null float64\n",
" 16 SubwayStation 5891 non-null object \n",
" 17 N_FacilitiesNearBy(PublicOffice) 5891 non-null float64\n",
" 18 N_FacilitiesNearBy(Hospital) 5891 non-null int64 \n",
" 19 N_FacilitiesNearBy(Dpartmentstore) 5891 non-null float64\n",
" 20 N_FacilitiesNearBy(Mall) 5891 non-null float64\n",
" 21 N_FacilitiesNearBy(ETC) 5891 non-null float64\n",
" 22 N_FacilitiesNearBy(Park) 5891 non-null float64\n",
" 23 N_SchoolNearBy(Elementary) 5891 non-null float64\n",
" 24 N_SchoolNearBy(Middle) 5891 non-null float64\n",
" 25 N_SchoolNearBy(High) 5891 non-null float64\n",
" 26 N_SchoolNearBy(University) 5891 non-null float64\n",
" 27 N_FacilitiesInApt 5891 non-null int64 \n",
" 28 N_FacilitiesNearBy(Total) 5891 non-null float64\n",
" 29 N_SchoolNearBy(Total) 5891 non-null float64\n",
"dtypes: float64(16), int64(8), object(6)\n",
"memory usage: 1.3+ MB\n"
]
}
],
"source": [
"data.info()"
]
},
{
"cell_type": "code",
"execution_count": 111,
"id": "9a599fe4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" SalePrice \n",
" YearBuilt \n",
" YrSold \n",
" MonthSold \n",
" Size(sqf) \n",
" Floor \n",
" N_Parkinglot(Ground) \n",
" N_Parkinglot(Basement) \n",
" N_APT \n",
" N_manager \n",
" ... \n",
" N_FacilitiesNearBy(Mall) \n",
" N_FacilitiesNearBy(ETC) \n",
" N_FacilitiesNearBy(Park) \n",
" N_SchoolNearBy(Elementary) \n",
" N_SchoolNearBy(Middle) \n",
" N_SchoolNearBy(High) \n",
" N_SchoolNearBy(University) \n",
" N_FacilitiesInApt \n",
" N_FacilitiesNearBy(Total) \n",
" N_SchoolNearBy(Total) \n",
" \n",
" \n",
" \n",
" \n",
" count \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" ... \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" 5891.000000 \n",
" \n",
" \n",
" mean \n",
" 221218.112545 \n",
" 2002.967408 \n",
" 2012.691563 \n",
" 6.160244 \n",
" 955.569173 \n",
" 12.026311 \n",
" 195.883551 \n",
" 570.761670 \n",
" 5.613648 \n",
" 6.310304 \n",
" ... \n",
" 0.941436 \n",
" 1.941266 \n",
" 0.654218 \n",
" 3.022407 \n",
" 2.417756 \n",
" 2.659311 \n",
" 2.764726 \n",
" 5.809540 \n",
" 9.870820 \n",
" 10.864200 \n",
" \n",
" \n",
" std \n",
" 106384.186446 \n",
" 8.811782 \n",
" 2.905488 \n",
" 3.387752 \n",
" 382.464050 \n",
" 7.548743 \n",
" 218.597210 \n",
" 408.621075 \n",
" 2.811831 \n",
" 3.174088 \n",
" ... \n",
" 0.401355 \n",
" 2.201392 \n",
" 0.658320 \n",
" 0.954575 \n",
" 1.037898 \n",
" 1.556041 \n",
" 1.489289 \n",
" 2.330804 \n",
" 3.450319 \n",
" 4.438513 \n",
" \n",
" \n",
" min \n",
" 32743.000000 \n",
" 1978.000000 \n",
" 2007.000000 \n",
" 1.000000 \n",
" 135.000000 \n",
" 1.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" 1.000000 \n",
" 1.000000 \n",
" ... \n",
" 0.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" 1.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" \n",
" \n",
" 25% \n",
" 144247.000000 \n",
" 1993.000000 \n",
" 2010.000000 \n",
" 3.000000 \n",
" 644.000000 \n",
" 6.000000 \n",
" 11.000000 \n",
" 184.000000 \n",
" 3.000000 \n",
" 5.000000 \n",
" ... \n",
" 1.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" 2.000000 \n",
" 2.000000 \n",
" 1.000000 \n",
" 2.000000 \n",
" 4.000000 \n",
" 8.000000 \n",
" 7.000000 \n",
" \n",
" \n",
" 50% \n",
" 207964.000000 \n",
" 2006.000000 \n",
" 2013.000000 \n",
" 6.000000 \n",
" 910.000000 \n",
" 11.000000 \n",
" 100.000000 \n",
" 536.000000 \n",
" 7.000000 \n",
" 6.000000 \n",
" ... \n",
" 1.000000 \n",
" 1.000000 \n",
" 1.000000 \n",
" 3.000000 \n",
" 3.000000 \n",
" 2.000000 \n",
" 2.000000 \n",
" 5.000000 \n",
" 9.000000 \n",
" 10.000000 \n",
" \n",
" \n",
" 75% \n",
" 291150.000000 \n",
" 2008.000000 \n",
" 2015.000000 \n",
" 9.000000 \n",
" 1149.000000 \n",
" 17.000000 \n",
" 249.000000 \n",
" 798.000000 \n",
" 8.000000 \n",
" 8.000000 \n",
" ... \n",
" 1.000000 \n",
" 5.000000 \n",
" 1.000000 \n",
" 4.000000 \n",
" 3.000000 \n",
" 4.000000 \n",
" 4.000000 \n",
" 7.000000 \n",
" 13.000000 \n",
" 15.000000 \n",
" \n",
" \n",
" max \n",
" 585840.000000 \n",
" 2015.000000 \n",
" 2017.000000 \n",
" 12.000000 \n",
" 2337.000000 \n",
" 43.000000 \n",
" 713.000000 \n",
" 1321.000000 \n",
" 13.000000 \n",
" 14.000000 \n",
" ... \n",
" 2.000000 \n",
" 5.000000 \n",
" 2.000000 \n",
" 6.000000 \n",
" 4.000000 \n",
" 5.000000 \n",
" 5.000000 \n",
" 10.000000 \n",
" 16.000000 \n",
" 17.000000 \n",
" \n",
" \n",
"
\n",
"
8 rows × 24 columns
\n",
"
"
],
"text/plain": [
" SalePrice YearBuilt YrSold MonthSold Size(sqf) \\\n",
"count 5891.000000 5891.000000 5891.000000 5891.000000 5891.000000 \n",
"mean 221218.112545 2002.967408 2012.691563 6.160244 955.569173 \n",
"std 106384.186446 8.811782 2.905488 3.387752 382.464050 \n",
"min 32743.000000 1978.000000 2007.000000 1.000000 135.000000 \n",
"25% 144247.000000 1993.000000 2010.000000 3.000000 644.000000 \n",
"50% 207964.000000 2006.000000 2013.000000 6.000000 910.000000 \n",
"75% 291150.000000 2008.000000 2015.000000 9.000000 1149.000000 \n",
"max 585840.000000 2015.000000 2017.000000 12.000000 2337.000000 \n",
"\n",
" Floor N_Parkinglot(Ground) N_Parkinglot(Basement) N_APT \\\n",
"count 5891.000000 5891.000000 5891.000000 5891.000000 \n",
"mean 12.026311 195.883551 570.761670 5.613648 \n",
"std 7.548743 218.597210 408.621075 2.811831 \n",
"min 1.000000 0.000000 0.000000 1.000000 \n",
"25% 6.000000 11.000000 184.000000 3.000000 \n",
"50% 11.000000 100.000000 536.000000 7.000000 \n",
"75% 17.000000 249.000000 798.000000 8.000000 \n",
"max 43.000000 713.000000 1321.000000 13.000000 \n",
"\n",
" N_manager ... N_FacilitiesNearBy(Mall) N_FacilitiesNearBy(ETC) \\\n",
"count 5891.000000 ... 5891.000000 5891.000000 \n",
"mean 6.310304 ... 0.941436 1.941266 \n",
"std 3.174088 ... 0.401355 2.201392 \n",
"min 1.000000 ... 0.000000 0.000000 \n",
"25% 5.000000 ... 1.000000 0.000000 \n",
"50% 6.000000 ... 1.000000 1.000000 \n",
"75% 8.000000 ... 1.000000 5.000000 \n",
"max 14.000000 ... 2.000000 5.000000 \n",
"\n",
" N_FacilitiesNearBy(Park) N_SchoolNearBy(Elementary) \\\n",
"count 5891.000000 5891.000000 \n",
"mean 0.654218 3.022407 \n",
"std 0.658320 0.954575 \n",
"min 0.000000 0.000000 \n",
"25% 0.000000 2.000000 \n",
"50% 1.000000 3.000000 \n",
"75% 1.000000 4.000000 \n",
"max 2.000000 6.000000 \n",
"\n",
" N_SchoolNearBy(Middle) N_SchoolNearBy(High) \\\n",
"count 5891.000000 5891.000000 \n",
"mean 2.417756 2.659311 \n",
"std 1.037898 1.556041 \n",
"min 0.000000 0.000000 \n",
"25% 2.000000 1.000000 \n",
"50% 3.000000 2.000000 \n",
"75% 3.000000 4.000000 \n",
"max 4.000000 5.000000 \n",
"\n",
" N_SchoolNearBy(University) N_FacilitiesInApt \\\n",
"count 5891.000000 5891.000000 \n",
"mean 2.764726 5.809540 \n",
"std 1.489289 2.330804 \n",
"min 0.000000 1.000000 \n",
"25% 2.000000 4.000000 \n",
"50% 2.000000 5.000000 \n",
"75% 4.000000 7.000000 \n",
"max 5.000000 10.000000 \n",
"\n",
" N_FacilitiesNearBy(Total) N_SchoolNearBy(Total) \n",
"count 5891.000000 5891.000000 \n",
"mean 9.870820 10.864200 \n",
"std 3.450319 4.438513 \n",
"min 0.000000 0.000000 \n",
"25% 8.000000 7.000000 \n",
"50% 9.000000 10.000000 \n",
"75% 13.000000 15.000000 \n",
"max 16.000000 17.000000 \n",
"\n",
"[8 rows x 24 columns]"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.describe()"
]
},
{
"cell_type": "code",
"execution_count": 112,
"id": "bbfb9414",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"data.plot.scatter(x=\"Size(sqf)\",y=\"SalePrice\")"
]
},
{
"cell_type": "markdown",
"id": "ec389d38",
"metadata": {},
"source": [
"### 2. Tratamiento de missing, reparación dataset y codificación de variables"
]
},
{
"cell_type": "markdown",
"id": "5917a1a2",
"metadata": {},
"source": [
"1. Seleccionar la variable a predecir (SalePrice) crear un vector llamdo Y con dicha información"
]
},
{
"cell_type": "code",
"execution_count": 113,
"id": "06f00ecf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 141592\n",
"1 51327\n",
"2 48672\n",
"3 380530\n",
"4 221238\n",
" ... \n",
"5886 511504\n",
"5887 298230\n",
"5888 357522\n",
"5889 312389\n",
"5890 393805\n",
"Name: SalePrice, Length: 5891, dtype: int64"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Y = data[\"SalePrice\"]\n",
"Y"
]
},
{
"cell_type": "markdown",
"id": "b823690f",
"metadata": {},
"source": [
"2. Eliminar la columna SalePrice del dataset"
]
},
{
"cell_type": "code",
"execution_count": 114,
"id": "eb37fed0",
"metadata": {},
"outputs": [],
"source": [
"data = data.drop([\"SalePrice\"],axis=1)"
]
},
{
"cell_type": "markdown",
"id": "4be0d94e",
"metadata": {},
"source": [
"3. Identificar las columnas numericas para luego normalizar "
]
},
{
"cell_type": "code",
"execution_count": 115,
"id": "e13f6f8c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['YearBuilt',\n",
" 'YrSold',\n",
" 'MonthSold',\n",
" 'Size(sqf)',\n",
" 'Floor',\n",
" 'N_Parkinglot(Ground)',\n",
" 'N_Parkinglot(Basement)',\n",
" 'N_APT',\n",
" 'N_manager',\n",
" 'N_elevators',\n",
" 'N_FacilitiesNearBy(PublicOffice)',\n",
" 'N_FacilitiesNearBy(Hospital)',\n",
" 'N_FacilitiesNearBy(Dpartmentstore)',\n",
" 'N_FacilitiesNearBy(Mall)',\n",
" 'N_FacilitiesNearBy(ETC)',\n",
" 'N_FacilitiesNearBy(Park)',\n",
" 'N_SchoolNearBy(Elementary)',\n",
" 'N_SchoolNearBy(Middle)',\n",
" 'N_SchoolNearBy(High)',\n",
" 'N_SchoolNearBy(University)',\n",
" 'N_FacilitiesInApt',\n",
" 'N_FacilitiesNearBy(Total)',\n",
" 'N_SchoolNearBy(Total)']"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numericalcols=list(data.select_dtypes(exclude='object').columns)\n",
"\n",
"numericalcols"
]
},
{
"cell_type": "markdown",
"id": "96f7c676",
"metadata": {},
"source": [
"4. Transformar la variables categóricas usando el método get_dummies() de pandas"
]
},
{
"cell_type": "code",
"execution_count": 116,
"id": "3323a72c",
"metadata": {},
"outputs": [],
"source": [
"data = pd.get_dummies(data)"
]
},
{
"cell_type": "markdown",
"id": "cbf274eb",
"metadata": {},
"source": [
"5. Normalizar solo las variables numericas previamente encontradas."
]
},
{
"cell_type": "code",
"execution_count": 117,
"id": "24a88313",
"metadata": {},
"outputs": [],
"source": [
"normalizer =StandardScaler()"
]
},
{
"cell_type": "code",
"execution_count": 118,
"id": "92674c0b",
"metadata": {},
"outputs": [],
"source": [
"data[numericalcols]=normalizer.fit_transform(data[numericalcols])"
]
},
{
"cell_type": "markdown",
"id": "961708e1",
"metadata": {},
"source": [
"6. Imprimir el head del dataset resultante"
]
},
{
"cell_type": "code",
"execution_count": 119,
"id": "830a7bab",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" YearBuilt \n",
" YrSold \n",
" MonthSold \n",
" Size(sqf) \n",
" Floor \n",
" N_Parkinglot(Ground) \n",
" N_Parkinglot(Basement) \n",
" N_APT \n",
" N_manager \n",
" N_elevators \n",
" ... \n",
" TimeToSubway_5min~10min \n",
" TimeToSubway_no_bus_stop_nearby \n",
" SubwayStation_Bangoge \n",
" SubwayStation_Banwoldang \n",
" SubwayStation_Chil-sung-market \n",
" SubwayStation_Daegu \n",
" SubwayStation_Kyungbuk_uni_hospital \n",
" SubwayStation_Myung-duk \n",
" SubwayStation_Sin-nam \n",
" SubwayStation_no_subway_nearby \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0.344181 \n",
" -1.959067 \n",
" 0.543107 \n",
" -0.370182 \n",
" -1.195839 \n",
" -0.388343 \n",
" -0.946585 \n",
" -0.929597 \n",
" -1.043004 \n",
" -1.427953 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" -2.039194 \n",
" -1.959067 \n",
" 0.543107 \n",
" -0.963752 \n",
" -0.533420 \n",
" -0.530169 \n",
" -1.210911 \n",
" -1.640938 \n",
" -1.358081 \n",
" -1.171726 \n",
" ... \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" -2.039194 \n",
" -1.959067 \n",
" 0.543107 \n",
" -0.963752 \n",
" -0.798388 \n",
" -0.530169 \n",
" -1.210911 \n",
" -1.640938 \n",
" -1.358081 \n",
" -1.171726 \n",
" ... \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 0.344181 \n",
" -1.959067 \n",
" 0.543107 \n",
" 2.877458 \n",
" -0.533420 \n",
" 0.243008 \n",
" -0.085078 \n",
" 0.137414 \n",
" -0.412848 \n",
" -0.018703 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" -1.131242 \n",
" -1.959067 \n",
" 0.543107 \n",
" 2.106078 \n",
" -1.195839 \n",
" 1.496562 \n",
" -0.085078 \n",
" 0.848755 \n",
" 0.532386 \n",
" 1.134320 \n",
" ... \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
5 rows × 46 columns
\n",
"
"
],
"text/plain": [
" YearBuilt YrSold MonthSold Size(sqf) Floor N_Parkinglot(Ground) \\\n",
"0 0.344181 -1.959067 0.543107 -0.370182 -1.195839 -0.388343 \n",
"1 -2.039194 -1.959067 0.543107 -0.963752 -0.533420 -0.530169 \n",
"2 -2.039194 -1.959067 0.543107 -0.963752 -0.798388 -0.530169 \n",
"3 0.344181 -1.959067 0.543107 2.877458 -0.533420 0.243008 \n",
"4 -1.131242 -1.959067 0.543107 2.106078 -1.195839 1.496562 \n",
"\n",
" N_Parkinglot(Basement) N_APT N_manager N_elevators ... \\\n",
"0 -0.946585 -0.929597 -1.043004 -1.427953 ... \n",
"1 -1.210911 -1.640938 -1.358081 -1.171726 ... \n",
"2 -1.210911 -1.640938 -1.358081 -1.171726 ... \n",
"3 -0.085078 0.137414 -0.412848 -0.018703 ... \n",
"4 -0.085078 0.848755 0.532386 1.134320 ... \n",
"\n",
" TimeToSubway_5min~10min TimeToSubway_no_bus_stop_nearby \\\n",
"0 0 0 \n",
"1 1 0 \n",
"2 1 0 \n",
"3 0 0 \n",
"4 0 0 \n",
"\n",
" SubwayStation_Bangoge SubwayStation_Banwoldang \\\n",
"0 0 0 \n",
"1 0 0 \n",
"2 0 0 \n",
"3 0 0 \n",
"4 0 0 \n",
"\n",
" SubwayStation_Chil-sung-market SubwayStation_Daegu \\\n",
"0 0 0 \n",
"1 0 1 \n",
"2 0 1 \n",
"3 0 0 \n",
"4 0 0 \n",
"\n",
" SubwayStation_Kyungbuk_uni_hospital SubwayStation_Myung-duk \\\n",
"0 1 0 \n",
"1 0 0 \n",
"2 0 0 \n",
"3 0 0 \n",
"4 0 1 \n",
"\n",
" SubwayStation_Sin-nam SubwayStation_no_subway_nearby \n",
"0 0 0 \n",
"1 0 0 \n",
"2 0 0 \n",
"3 1 0 \n",
"4 0 0 \n",
"\n",
"[5 rows x 46 columns]"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "markdown",
"id": "1f44b37f",
"metadata": {},
"source": [
"### 3. Determinar el conjunto de entrenamiento y el de validación.\n",
"\n",
"\n",
"1. Hacer división de los datos 80% train , 20% test Crear un vector X el cual contiene las características \n",
"2. Imprimir el shape o dimensiones del vector de entrenamiento (x_train)\n",
"2. Imprimir el shape o dimensiones del vector de prueba (x_test)\n",
"Ayuda: usar la función train_test_split de sklearn https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\n"
]
},
{
"cell_type": "code",
"execution_count": 120,
"id": "fcfe125e",
"metadata": {},
"outputs": [],
"source": [
"X = data\n",
"x_train,x_test,y_train,y_test=train_test_split(X,Y,random_state=89,test_size=0.2)"
]
},
{
"cell_type": "code",
"execution_count": 121,
"id": "cd47fbba",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dimensiones vector de entrenamiento (4712, 46)\n",
"Dimensiones vector de prueba (1179, 46)\n"
]
}
],
"source": [
"print(\"Dimensiones vector de entrenamiento\", x_train.shape)\n",
"print(\"Dimensiones vector de prueba\", x_test.shape)"
]
},
{
"cell_type": "markdown",
"id": "4faa0b54",
"metadata": {},
"source": [
"### 4. Entrenamiento del modelo\n",
"\n",
"\n",
"1. Crear un RandomForestRegressor model usando la librería sklearn https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html\n",
"\n",
"2. Entrenar el modelo\n",
"\n",
"Ayudas:\n",
"\n",
"- Usar la función fit\n",
"- Solo usar el conjunto de entrenamiento (X_train, y_train)\n"
]
},
{
"cell_type": "code",
"execution_count": 122,
"id": "878a71e6",
"metadata": {},
"outputs": [],
"source": [
"regressor = RandomForestRegressor()"
]
},
{
"cell_type": "code",
"execution_count": 123,
"id": "0dd38bfa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestRegressor()"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regressor.fit(x_train, y_train)"
]
},
{
"cell_type": "markdown",
"id": "665b5534",
"metadata": {},
"source": [
"### 5. Calcular las métricas de evaluación\n",
"\n",
"**Nota:** Ejecutar la siguiente función, la cual calcula crea la matriz de confusión y algunas métricas. "
]
},
{
"cell_type": "markdown",
"id": "82e61a22",
"metadata": {},
"source": [
"1. Usar la función predict() para crear el vector de predicciones\n",
"\n",
"\n",
"Ayuda: Utilice el conjunto de test (X_test)"
]
},
{
"cell_type": "code",
"execution_count": 124,
"id": "47d65de7",
"metadata": {},
"outputs": [],
"source": [
"y_predict = regressor.predict(x_test)"
]
},
{
"cell_type": "markdown",
"id": "a772c4e5",
"metadata": {},
"source": [
"2. Calcular métricas de error"
]
},
{
"cell_type": "code",
"execution_count": 125,
"id": "17691248",
"metadata": {},
"outputs": [],
"source": [
"mae_test = m.mean_absolute_error(y_test, y_predict )\n",
"mape_test = np.mean(np.abs((y_test - y_predict)/ y_test))\n",
"MSE_test = mean_squared_error(y_test,y_predict)\n",
"RMSE_test = mean_squared_error(y_test,y_predict,squared=False) \n",
"R2_test = r2_score(y_test,y_predict)"
]
},
{
"cell_type": "code",
"execution_count": 126,
"id": "a1bade10",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAE 9310.815854580458\n",
"MAPE 0.04697561991914874\n",
"MSE 213944339.90053365\n",
"RMSE 14626.836291574937\n",
"R2 0.9806543553028587\n"
]
}
],
"source": [
"print(\"MAE\",mae_test)\n",
"print(\"MAPE\",mape_test)\n",
"print(\"MSE\",MSE_test)\n",
"print(\"RMSE\",RMSE_test)\n",
"print(\"R2\",R2_test)"
]
},
{
"cell_type": "markdown",
"id": "fff6abf1",
"metadata": {},
"source": [
"### 6. Conclusiones\n",
"\n",
"Describa brevemente los resultados obtenidos"
]
},
{
"cell_type": "markdown",
"id": "9db5b8d8",
"metadata": {},
"source": [
"El modelo presento un desempeño de 4% de Error Promedio Porcentual Medio (MAPE) y un R-cuadrado de 0.98 lo cual indica una correlación lineal positiva ya que es cercano 1 por lo tanto este modelo representa una buena solución a este problema."
]
},
{
"cell_type": "markdown",
"id": "9f51468e",
"metadata": {},
"source": [
"Realizar un gráfico de dispersión entre y_test y y_predict"
]
},
{
"cell_type": "code",
"execution_count": 127,
"id": "31f44fa7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(y_test,y_predict)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93cb1d78",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}