
Note: for the purposes of this article consider the range of the numbers we can assign between 0 and \(+\infty\) with 0 being the smallest number. The question that arises is how do we assign numeric values to text categorical data? The “Position” feature is all text and it is what we will need to convert into model-friendly numeric format. Every row represents a position that an individual holds and the corresponding annual salary. To get a sense how label encoding works, let’s take a look at the following dataset:Īssume it is the data that we would like to feed into some machine learning algorithm.
TEXT ENCODING TRANSLATION AUTCMATIC INSTALL
If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code: To continue following this tutorial we will need the following two Python libraries: sklearn and pandas. We will also outline cases when it should/shouldn’t be applied. In this article we will focus on label encoding and it’s variations. These are valid solutions with their own benefits and costs. And how sensitive it is to the ranges and distributions of numerical features.īoth techniques allow for conversion from categorical/text data to numeric format. There are multiple ways to solve this problem and a lot depends on the algorithm you will be working with. The complication it creates is the fact that machine learning algorithms in fact can work with categorical features, yet they have to be in numeric form. For example, when we work with datasets for salary estimation based on different sets of features, we often see job title being entered in words, for example: Manager, Director, Vice-President, President, and so on. In data science, we often work with datasets that contain categorical variables, where the values are represented by strings. Advantages and disadvantages of label encoding.In this tutorial we will discuss label encoding in Python.
