Python - Pandas Cheat Sheet

fevereiro 26, 2020

Artigo completo em português: https://danielpms.com.br/artigos/python-pandas-guia-rapido-exploracao-dados.html

Pandas is a Python library for data analysis and data manipulation with a lot of tools that help us to work with databases like MySQL, data based on text files like csv, json, html or Excel files.

Free, powerful and with a low learning curve it's a great tool for beginners on Python language and an alternative to people that use Excel as an analysis tool.

To start with Pandas or learn the basics you don't need to install any software, just access an online tool like Google Colab.

My intention in this post is to show a small cheat sheet with basics of Pandas as a kind of beginner's guide or a consultation guide.

Let's go

To load Pandas library

 import pandas as pd

Importing data from external source

Import text files

  // Regular csv separated by commas
  pd.read_csv('filename')
 
  // Separated by semicolon
  pd.read_csv('filename', sep=';', engine='python')

  // CSV in latin encode
  pd.read_csv('filename', encoding='iso-8859-1')

  // Load just some lines from .csv
  pd.read_csv('filename', nrows=integer with the number of rows)

  // Load file data with custom column names 
  pd.read_csv('filename', names=['Column #1','Column #2'], engine='python')

  // Load data just from some columns (to save memory)
  pd.read_csv('filename', usecols=[0,1,2...])
  pd.read_csv('filename', usecols=['Column #1','Column #2'])

  // Delimited text file tsv
  pd.read_table('filename')

  // Json string, json file or URL
  pd.read_json('json source')

  // Parse HTML as string, file or URL and gets table markups
  pd.read_html('html source')

Import data from excel

 pd.read_excel('filename')
 
 // Read multiple sheets
 file = pd.ExcelFile('filename.xlsx')
 df = pd.read_excel(file, 'Sheet Name')

Import data from SQL

 pd.read_sql(query, connection_configs)

Exporting data

Exports as csv

 df.to_csv('filename')
 
 // Export with latin encode
 df.to_csv('filename', encoding='iso-8859-1')

 // Export CSV without internal index
 df.to_csv('filename', index=False)

 // Export with semicolon delimiter and header
 df.to_csv('filename', sep=';', header='True')

Exports as Excel

 df.to_excel('filename')

 // For a specific sheet
 df.to_excel('path_to_excel_file.xlsx', sheet_name='Sheet Name')

 // Without index column
 df.to_excel('path_to_excel_file.xlsx', sheet_name='Sheet name', index=False)

Exports as Json

 df.to_json('filename')

Exports as HTML table

 df.to_html('filename')

Writes to a table into a SQL database

 df.to_sql(table, connection_configs)

Data structures

Data Frame

Data Frames (two dimensional data structure) looks like a excel sheet.

Creating from a external source

 data = pd.read_csv('file_1.csv')
 df = pd.DataFrame(data)

Manually created

 data = {'name': ['Anne', 'John', 'Beth'], 'age': [23, 18, 86]}
 df = pd.DataFrame(data, columns=['name','age'])

Series

Creating series (one dimensional data structure)

 serie = pd.Series([1,2,'a','b'], index=['one', 'two', 'three', 'four'])

 // With auto index
 autoSerie = pd.Series(['init', 'my', 'serie'])

Combining, Merging or Joining DataFrames

Standard joins

 pd.merge(df1, df2, how='left|right|inner|outer' on='index_column_name')

where df1 and df2 are DataFrames, how is the join method and on is the common field into df1 and df2 as reference to join data.

Internal Pandas index join

 pd.merge(df1, df2, left_index=True, right_index=True, how='left|right|inner|outer')

where left_index and right_index are the Pandas internal index

Select all rows in df1 DataFrame that have a match em df2 Dataframe

 df1[df1['column name'].isin(df2['column name'])]

Select all rows in df1 DataFrame that doesn't match df2 DataFrame rows

 df1[~df1['column name'].isin(df1['column name'])]

Columns unique values

Like SQL 'distinct'

 df.drop_duplicates(['column name'])

Show just some columns

Filter DataFrame with specific column names

 df['column name']
 
 or

 df[['column name 1', 'column name 2']]

Filter DataFrame columns using regular expession

 df.filter(regex='regex expression')

DataFrame Sort

Sort 'column 1' in ascending order and 'column 2' in descending order. Dataframe df was modified with inplace=True.

 df.sort_values(['column 1','column 2'], ascending=[True,False], inplace=True)

Pandas options

Show all columns from a DataFrame

 pd.set_option('display.max_columns', None)
 df.head()

Pesquisar este blog

Bytes Every Day