Show IntroductionWhen doing data analysis, it is important to make sure you are using the correct data types; otherwise you may get unexpected results or errors. In the case of pandas, it will correctly infer data types in many cases and you can move on with your analysis without any further thought on the topic. Despite
how well pandas works, at some point in your data analysis processes, you will likely need to explicitly convert data from one type to another. This article will discuss the basic pandas data types (aka Pandas Data TypesA data type is essentially an internal construct that a programming language uses to understand how to store and manipulate data. For instance, a program needs to understand that you can add two numbers together like 5 + 10 to get 15. Or, if you have two strings such as “cat” and “hat” you could concatenate (add) them together to get “cathat.” A possible confusing point about pandas data types is that there is some overlap between pandas, python and numpy. This table summarizes the key points: Pandas
dtype mapping
For the most part, there is no need to worry about determining if you should try to explicitly force the pandas type to a corresponding to NumPy type. Most of the time, using pandas default For this article, I will focus on the follow pandas types:
The One other item I want to highlight is that the Why do we care?Data types are one of those things that you don’t tend to care about until you get an error or some unexpected results. It is also one of the first things you should check once you load a new data into pandas for further analysis. I will use a very simple CSV file to illustrate a couple of common errors you might see in pandas if the data type is not correct. Additionally, an example notebook is up on github. import numpy as np import pandas as pd df = pd.read_csv("sales_data_types.csv")
Upon first glance, the data looks ok so we could try doing some operations to analyze the data. Let’s try adding together the 2016 and 2017 sales: 0 $125,000.00$162500.00 1 $920,000.00$101,2000.00 2 $50,000.00$62500.00 3 $350,000.00$490000.00 4 $15,000.00$12750.00 dtype: object This does not look right. We would like to get totals added together but pandas is just concatenating the two values together to create one long string. A clue to the problem is the line that says If we want to see what all the data types are in a dataframe, use Customer Number float64 Customer Name object 2016 object 2017 object Percent Growth object Jan Units object Month int64 Day int64 Year int64 Active object dtype: object Additionally, the <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 10 columns): Customer Number 5 non-null float64 Customer Name 5 non-null object 2016 5 non-null object 2017 5 non-null object Percent Growth 5 non-null object Jan Units 5 non-null object Month 5 non-null int64 Day 5 non-null int64 Year 5 non-null int64 Active 5 non-null object dtypes: float64(1), int64(3), object(6) memory usage: 480.0+ bytes After looking at the automatically assigned data types, there are several concerns:
Until we clean up these data types, it is going to be very difficult to do much additional analysis on this data. In order to convert data types in pandas, there are three basic options:
Using the astype() functionThe simplest way to convert a pandas column of data to a different type is to use df['Customer Number'].astype('int') 0 10002 1 552278 2 23477 3 24900 4 651029 Name: Customer Number, dtype: int64 In order to actually change the customer number in the original dataframe, make sure to
assign it back since the df["Customer Number"] = df['Customer Number'].astype('int') df.dtypes Customer Number int64 Customer Name object 2016 object 2017 object Percent Growth object Jan Units object Month int64 Day int64 Year int64 Active object dtype: object And here is the new data frame with the Customer Number as an integer:
This all looks good and seems pretty simple. Let’s try to do the same thing to our df['2016'].astype('float') ValueError Traceback (most recent call last) <ipython-input-45-999869d577b0> in <module>() ----> 1 df['2016'].astype('float') [lots more code here] ValueError: could not convert string to float: '$15,000.00' In a similar manner, we can try to conver the df['Jan Units'].astype('int') ValueError Traceback (most recent call last) <ipython-input-44-31333711e4a4> in <module>() ----> 1 df['Jan Units'].astype('int') [lots more code here] ValueError: invalid literal for int() with base 10: 'Closed' Both of these return In each of the cases, the data included values that
could not be interpreted as numbers. In the sales columns, the data includes a currency symbol as well as a comma in each value. In the So far it’s not looking so good for df['Active'].astype('bool') 0 True 1 True 2 True 3 True 4 True Name: Active, dtype: bool At first glance, this looks ok but upon closer inspection, there is a big problem. All values were interpreted as
The takeaway from this section is that
If the data has non-numeric characters or is not homogeneous, then Custom Conversion FunctionsSince this data is a little more complex to convert, we can build a custom function that we apply to each value and convert to the appropriate data type. For currency conversion (of this specific data set), here is a simple function we can use: def convert_currency(val): """ Convert the string number value to a float - Remove $ - Remove commas - Convert to float type """ new_val = val.replace(',','').replace('$', '') return float(new_val) The code uses python’s string functions to strip out the ‘$” and ‘,’ and then convert the value to a floating point number. In this specific case, we could convert the values to integers as well but I’m choosing to use floating point in this case. I also suspect that someone will recommend that we use a Also of note, is that the function converts the number to a python Now, we can use the pandas df['2016'].apply(convert_currency) 0 125000.0 1 920000.0 2 50000.0 3 350000.0 4 15000.0 Name: 2016, dtype: float64 Success! All the values are showing as I’m sure that the more experienced readers are asking why I did not just use a lambda function? Before I answer, here is what we could do in 1 line with a df['2016'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float') Using
Some may also argue that other lambda-based approaches have performance improvements over the custom function. That may be true but for the purposes of teaching new users, I think the function approach is preferrable. Here’s a full example of converting the data in both sales columns using the df['2016'] = df['2016'].apply(convert_currency) df['2017'] = df['2017'].apply(convert_currency) df.dtypes Customer Number int64 Customer Name object 2016 float64 2017 float64 Percent Growth object Jan Units object Month int64 Day int64 Year int64 Active object dtype: object For another example of using Using the
df['Percent Growth'].apply(lambda x: x.replace('%', '')).astype('float') / 100 Doing the same thing with a custom function: def convert_percent(val): """ Convert the percentage string to an actual floating point percent - Remove % - Divide by 100 to make decimal """ new_val = val.replace('%', '') return float(new_val) / 100 df['Percent Growth'].apply(convert_percent) Both produce the same value: 0 0.30 1 0.10 2 0.25 3 0.04 4 -0.15 Name: Percent Growth, dtype: float64 The final custom function I will cover is using The basic idea is to use the df["Active"] = np.where(df["Active"] == "Y", True, False) Which results in the following dataframe:
The dtype is appropriately set to Customer Number float64 Customer Name object 2016 object 2017 object Percent Growth object Jan Units object Month int64 Day int64 Year int64 Active bool dtype: object Whether you choose to use a Pandas helper functionsPandas has a middle ground between the blunt If you have been following along, you’ll notice that I have not done anything with the date columns or the The reason the pd.to_numeric(df['Jan Units'], errors='coerce') 0 500.0 1 700.0 2 125.0 3 75.0 4 NaN Name: Jan Units, dtype: float64 There are a couple of items of note. First, the function easily processes the data and creates a pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0) 0 500.0 1 700.0 2 125.0 3 75.0 4 0.0 Name: Jan Units, dtype: float64 The final conversion I will cover is converting the separate month, day and year
columns into a pd.to_datetime(df[['Month', 'Day', 'Year']]) 0 2015-01-10 1 2014-06-15 2 2016-03-29 3 2015-10-27 4 2014-02-02 dtype: datetime64[ns] In this case, the function combines the columns into a new series of the appropriate We need to make sure to assign these values back to the dataframe: df["Start_Date"] = pd.to_datetime(df[['Month', 'Day', 'Year']]) df["Jan Units"] = pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)
Now the data is properly converted to all the types we need: Customer Number int64 Customer Name object 2016 float64 2017 float64 Percent Growth float64 Jan Units float64 Month int64 Day int64 Year int64 Active bool Start_Date datetime64[ns] The dataframe is ready for analysis! Bringing it all togetherThe basic concepts of using It is important to note that you can only apply a Here is a streamlined example that does almost all of the conversion at the time the data is read into the dataframe: df_2 = pd.read_csv("sales_data_types.csv", dtype={'Customer Number': 'int'}, converters={'2016': convert_currency, '2017': convert_currency, 'Percent Growth': convert_percent, 'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'), 'Active': lambda x: np.where(x == "Y", True, False) }) df_2.dtypes Customer Number int64 Customer Name object 2016 float64 2017 float64 Percent Growth float64 Jan Units float64 Month int64 Day int64 Year int64 Active object dtype: object As mentioned earlier, I chose to include a
SummaryOne of the first steps when exploring a new data set is making sure the data types are set correctly. Pandas makes reasonable inferences most of the time but there are enough subtleties in data sets that it is important to know how to use the various data conversion options available in pandas. If you have any other tips you have used or if there is interest in exploring the Changes
What is an object data type?The Object data type can point to data of any data type, including any object instance your application recognizes. Use Object when you do not know at compile time what data type the variable might point to. The default value of Object is Nothing (a null reference).
What are the 4 data types in Python?Following are the standard or built-in data type of Python:. Numeric.. Sequence Type.. Boolean.. Dictionary.. Is Dtype object a string?A clue to the problem is the line that says dtype: object. An object is a string in pandas so it performs a string operation instead of a mathematical one.
What is Dtype object syntax?Explanation : A dtype object is constructed using the following syntax : numpy.dtype(object, align, copy)
|