Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns. Show We will get a brief insight on all these basic operation which can be performed on Pandas DataFrame :
Creating a Pandas DataFrameIn the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe: Creating a dataframe using List: DataFrame can be created using a single list or a list of lists. # import pandas as pd import pandas as pd # list of strings lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks'] # Calling DataFrame constructor on list df = pd.DataFrame(lst) print(df) Output: Creating DataFrame from dict of ndarray/lists: To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length. # Python code demonstrate creating # DataFrame from dict narray / lists # By default addresses. import pandas as pd # intialise data of lists. data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]} # Create DataFrame df = pd.DataFrame(data) # Print the output. print(df) Output: Dealing with Rows and ColumnsA Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and renaming. Column Selection: In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name. # Import pandas package import pandas as pd # Define a dictionary containing employee data data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32], 'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'], 'Qualification':['Msc', 'MA', 'MCA', 'Phd']} # Convert the dictionary into DataFrame df = pd.DataFrame(data) # select two columns print(df[['Name', 'Qualification']]) Output: Row Selection: Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an iloc[] function.Note: We’ll be using nba.csv file in below examples.# importing pandas package import pandas as pd # making data frame from csv file data = pd.read_csv("nba.csv", index_col ="Name") # retrieving row by loc method first = data.loc["Avery Bradley"] second = data.loc["R.J. Hunter"] print(first, "\n\n\n", second) Output: For more Details refer to Dealing with Rows and Columns Indexing and Selecting DataIndexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection. Indexing a Dataframe using indexing operator Selecting a single columnsIn order to select a single column, we simply put the name of the column in-between the brackets # importing pandas package import pandas as pd # making data frame from csv file data = pd.read_csv("nba.csv", index_col ="Name") # retrieving columns by indexing operator first = data["Age"] print(first) Output: Indexing a DataFrame using .loc[ ] :This function selects data by the label of the rows and columns. The df.loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also
simultaneously select subsets of rows and columns.Selecting a single rowIn order to select a single row using # importing pandas package import pandas as pd # making data frame from csv file data = pd.read_csv("nba.csv", index_col ="Name") # retrieving row by loc method first = data.loc["Avery Bradley"] second = data.loc["R.J. Hunter"] print(first, "\n\n\n", second) Output: Indexing a DataFrame using .iloc[ ] :This function allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.Selecting a single rowIn order to select
a single row using import pandas as pd # making data frame from csv file data = pd.read_csv("nba.csv", index_col ="Name") # retrieving rows by iloc method row2 = data.iloc[3] print(row2) Output: For more Details refer
Working with Missing DataMissing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also refer to as NA(Not Available) values in pandas. Checking for missing values using # importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]} # creating a dataframe from list df = pd.DataFrame(dict) # using isnull() function df.isnull() Output: Filling missing values using fillna() , replace() and interpolate() : In order to fill null values in a datasets, we use fillna() , replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA values in
the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.# importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]} # creating a dataframe from dictionary df = pd.DataFrame(dict) # filling missing value using fillna() df.fillna(0) Output: Dropping missing values using dropna() :In order to drop a null values from a dataframe, we used dropna() function this fuction drop Rows/Columns of datasets with Null values in different
ways.# importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score':[52, 40, 80, 98], 'Fourth Score':[np.nan, np.nan, np.nan, 65]} # creating a dataframe from dictionary df = pd.DataFrame(dict) df Now we drop rows with at least one Nan value (Null value) # importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score':[52, 40, 80, 98], 'Fourth Score':[np.nan, np.nan, np.nan, 65]} # creating a dataframe from dictionary df = pd.DataFrame(dict) # using dropna() function df.dropna() Output: For more Details refer to Working with Missing Data in Pandas Iterating over rows and columnsIteration is a general term for taking each item of something, one after another. Pandas DataFrame consists of rows and columns so, in order to iterate over dataframe, we have to iterate a dataframe like a dictionary. Iterating over rows : # importing pandas as pd import pandas as pd # dictionary of lists dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"], 'degree': ["MBA", "BCA", "M.Tech", "MBA"], 'score':[90, 40, 80, 98]} # creating a dataframe from a dictionary df = pd.DataFrame(dict) print(df) Now we apply iterrows() function in order to get a each element of rows.# importing pandas as pd import pandas as pd # dictionary of lists dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"], 'degree': ["MBA", "BCA", "M.Tech", "MBA"], 'score':[90, 40, 80, 98]} # creating a dataframe from a dictionary df = pd.DataFrame(dict) # iterating over rows using iterrows() function for i, j in df.iterrows(): print(i, j) print() Output: Iterating over Columns : In order to iterate over columns, we need to create a list of dataframe columns and then iterating through that list to pull out the dataframe columns. # importing pandas as pd import pandas as pd # dictionary of lists dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"], 'degree': ["MBA", "BCA", "M.Tech", "MBA"], 'score':[90, 40, 80, 98]} # creating a dataframe from a dictionary df = pd.DataFrame(dict) print(df) Now we iterate through columns in order to iterate through columns we first create a list of dataframe columns and then iterate through list. # creating a list of dataframe columns columns = list(df) for i in columns: # printing the third element of the column print (df[i][2]) Output: For more Details refer to Iterating over rows and columns in Pandas DataFrame DataFrame Methods:
What is DataFrame in Python with example?What is a DataFrame? A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
Why DataFrame is used in Python?Pandas DataFrame is a widely used data structure which works with a two-dimensional array with labeled axes (rows and columns). DataFrame is defined as a standard way to store data that has two different indexes, i.e., row index and column index.
How do you read a DataFrame in Python?Read CSV Files. Load the CSV into a DataFrame: import pandas as pd. df = pd.read_csv('data.csv') ... . Print the DataFrame without the to_string() method: import pandas as pd. ... . Check the number of maximum returned rows: import pandas as pd. ... . Increase the maximum number of rows to display the entire DataFrame: import pandas as pd.. What is the difference between Numpy and DataFrame?Simply speaking, use Numpy array when there are complex mathematical operations to be performed. Use Pandas dataframe for ease of usage of data preprocessing including performing group operations, creation of Matplotlib plots, rows and columns operations.
|