Bảng cheat máy học github

bộ sưu tập cheatsheet

Bộ sưu tập cheatsheet cho môn Toán, A. I. , thư viện Python
Cập nhật thường xuyên

A. I. trang phục

Bảng cheat Deep Learning - Stanford

Tác giả. Afshine Amidi

trang phục AI
trang phục ML
trang phục DL

Bảng tính học máy

Tác giả. ghi nhớ

Bảng tính học máy

bảng tính toán

Một số công cụ hình ảnh cheatsheet

chức năng kích hoạt
Phân bổ
Đọc tại đây để biết thêm chi tiết hoặc thử tại đây để biết mã triển khai tóm tắt
Tối ưu hóa
Đọc ở đây để biết thêm chi tiết

Xác suất cheatsheet - Harvard's Stat 110

Tác giả. Joe Blitzstein - Giáo sư Thống kê tại Harvard

bảng xác suất
Sách

Bảng cheat Đại số tuyến tính - Wisconsin-Madison

Tác giả. Laurent Lessard

Bảng đại số tuyến tính

Bảng cheat thống kê/xác suất - Stanford

Tác giả. Shervine Amidi

bảng thống kê
bảng xác suất

Toán Tin Học - UIT

Tác giả. Ngọc-Hoàng Lương

Toán cho Khoa học Máy tính

bảng mã Python

trang phục thư viện

máy ảnh
Matplotlib
Nặng nề
gấu trúc
khoa học viễn tưởng
Scikit-Tìm hiểu

bảng tính Pytorch

bảng tính Pytorch
hướng dẫn pytorch

Giả sử chúng ta có tập dữ liệu ở định dạng có thể tải được (e. g. , csv), đây là các bước chúng tôi thực hiện để hoàn thành một dự án máy học

Một vài lưu ý trước khi chúng ta tiếp tục

Trước hết, học máy là một lĩnh vực có tính lặp lại cao. Điều này sẽ đòi hỏi một chu kỳ lặp của các bước trên, trong đó mỗi chu kỳ dựa trên phản hồi từ chu kỳ trước, với mục tiêu cải thiện hiệu suất của mô hình. Một ví dụ là chúng tôi cần cải tiến các mô hình khi chúng tôi thiết kế các tính năng mới và kiểm tra xem các tính năng này có mang tính dự báo hay không.

Thứ hai, trong khi trong các cuộc thi Kaggle, người ta có thể tạo ra một tập hợp các mô hình quái vật, nhưng trong hệ thống sản xuất, những tập hợp như vậy thường không hữu ích. Chúng có tính bảo trì cao, khó diễn giải và quá phức tạp để triển khai. Đây là lý do tại sao trong thực tế, mô hình đơn giản hơn cộng với lượng dữ liệu khổng lồ thường chiến thắng

Thứ ba, trong khi một số đoạn mã có thể tái sử dụng, mỗi bộ dữ liệu có tính duy nhất của riêng nó. Cần có những nỗ lực dành riêng cho bộ dữ liệu để xây dựng các mô hình tốt hơn

Ghi nhớ những điểm này, chúng ta hãy nhúng tay vào

Phân tích dữ liệu khám phá

Phân tích dữ liệu khám phá (EDA) là một cách tiếp cận để phân tích các tập dữ liệu để tóm tắt các đặc điểm chính của chúng, thường là với các biểu đồ. Mục tiêu của EDA là để hiểu sâu hơn về tập dữ liệu, đồng thời xử lý trước dữ liệu và thiết kế các tính năng hiệu quả hơn. Dưới đây là một số đoạn mã chung có thể được áp dụng cho bất kỳ tập dữ liệu có cấu trúc nào

thư viện nhập khẩu

import os
import fnmatch
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dữ liệu vào/ra

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

dữ liệu I/O được nén

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

Đến một bảng trong sqlite3 DB (sau đó bạn có thể sử dụng Trình duyệt DB cho SQLite để xem và truy vấn bảng)

import sqlite3
import pandas as pd

df = pd.read_csv(csv_file) # read csv file
sqlite_file = 'my_db.sqlite3'
conn = sqlite3.connect(sqlite_file) # establish a sqlite3 connection

# if db file exists append the csv
df.to_sql(tablename, conn, if_exists='append', index=False)

tóm tắt dữ liệu

df.head() # return the first 5 rows
df.describe() # summary statistics, excluding NaN values
df.info(verbose=True, null_counts=True) # concise summary of the table
df.shape # shape of dataset
df.skew() # skewness for numeric columns
df.kurt() # unbiased kurtosis for numeric columns
df.get_dtype_counts() # counts of dtypes

hiển thị tỷ lệ giá trị bị thiếu cho mỗi col

for c in df.columns:
  num_na = df[c].isnull().sum()
  if num_na > 0:
    print round(num_na / float(len(df)), 3), '|', c

tương quan cặp của các cột

df.corr()

âm mưu

sơ đồ nhiệt của ma trận tương quan (của tất cả các cột số)

cm = np.corrcoef(df.T)
sns.heatmap(cm, annot=True, yticklabels=df.columns, xticklabels=df.columns)

biểu đồ phân phối đơn biến

________số 8_______

ước tính mật độ hạt nhân cốt truyện (KDE)

# all continuous variables
for c in df.columns:
  if df[c].dtype in ['float64']:
    sns.kdeplot(df[c].dropna(), shade=True)
    plt.show()

cốt truyện các mối quan hệ theo cặp

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

hypertools là một hộp công cụ python để trực quan hóa và thao tác dữ liệu chiều cao. Điều này là mong muốn cho giai đoạn EDA

khám phá trực quan mối quan hệ giữa các tính năng và mục tiêu (trong không gian 3D)

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

phân tích hồi quy tuyến tính bằng từng PC

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

chia nhỏ theo nhãn

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

Để biết thêm các trường hợp sử dụng siêu công cụ, hãy kiểm tra sổ ghi chép và ví dụ

Sơ chế

thả cột

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

xử lý các giá trị còn thiếu

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

mã hóa các tính năng phân loại

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

tham gia hai bảng/dataframes

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

xử lý các ngoại lệ (các ngoại lệ có thể được cắt bớt hoặc loại bỏ. CẢNH BÁO. ngoại lệ không phải lúc nào cũng có nghĩa là bị loại bỏ)

Trong ví dụ sau, chúng tôi giả sử df là tất cả số và không có giá trị nào bị thiếu

cắt xén

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

gỡ bỏ

df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))

lọc

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

kỹ thuật tính năng

chuyển đổi

các tính năng phân loại mã hóa một lần nóng;

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

bình thường hóa các tính năng số (đến phạm vi [0, 1])

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

chuyển đổi nhật ký. đối với các cột có phân phối sai lệch cao, chúng ta có thể áp dụng phép biến đổi nhật ký

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

Sự sáng tạo

Tạo tính năng là nỗ lực của cả miền và kỹ thuật. Với sự trợ giúp từ các chuyên gia miền, chúng tôi có thể tạo thêm các tính năng dự đoán, nhưng đây là một số phương pháp tạo tính năng chung đáng để thử trên bất kỳ tập dữ liệu có cấu trúc nào

thêm tính năng. số giá trị còn thiếu

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

thêm tính năng. số không

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

thêm tính năng. giá trị nhị phân cho mỗi tính năng cho biết liệu một điểm dữ liệu có rỗng hay không

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

thêm tính năng tương tác

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

Lựa chọn

Có nhiều cách khác nhau để chọn các tính năng và một cách hiệu quả là loại bỏ tính năng đệ quy (RFE)

chọn tính năng bằng RFE

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

Để biết thêm các phương pháp kỹ thuật tính năng, vui lòng tham khảo blogpost này

học máy

Chiến lược xác nhận chéo (CV)

Các lý thuyết đầu tiên (một số được thông qua từ Andrew Ng). Trong học máy, chúng ta thường có các tập dữ liệu con sau

tập huấn luyện được sử dụng để chạy thuật toán học trên
tập phát triển (hoặc tập xác thực chéo tổ chức) được sử dụng để điều chỉnh tham số, chọn tính năng và đưa ra các quyết định khác liên quan đến thuật toán học tập
bộ kiểm tra được sử dụng để đánh giá hiệu suất của các thuật toán, nhưng KHÔNG đưa ra bất kỳ quyết định nào về việc sử dụng thuật toán hoặc tham số nào

Lý tưởng nhất là 3 bộ đó phải đến từ cùng một bản phân phối và phản ánh dữ liệu bạn mong muốn nhận được trong tương lai và muốn thực hiện tốt trên đó

Nếu chúng tôi có ứng dụng trong thế giới thực mà từ đó chúng tôi liên tục thu thập dữ liệu mới, thì chúng tôi có thể đào tạo dựa trên dữ liệu lịch sử và chia dữ liệu sắp tới thành các tập phát triển và thử nghiệm. Điều này nằm ngoài phạm vi của trang tính này. Ví dụ sau giả sử chúng tôi có tệp csv và chúng tôi muốn đào tạo một mô hình tốt nhất trên ảnh chụp nhanh này

Làm thế nào chúng ta nên chia ba bộ?

khóa đào tạo càng lớn càng tốt. )
tập phát triển phải đủ lớn để phát hiện sự khác biệt giữa các thuật toán (e. g. , phân loại A có độ chính xác 90% và phân loại B có 90. 1% thì tập phát triển gồm 100 ví dụ sẽ không thể phát hiện ra 0 này. chênh lệch 1%. Một cái gì đó khoảng 1.000 đến 10.000 sẽ làm được)
bộ kiểm tra phải đủ lớn để mang lại độ tin cậy cao về hiệu suất tổng thể của hệ thống (không sử dụng 30% dữ liệu một cách ngây thơ)

Đôi khi chúng ta có thể bị ràng buộc dữ liệu khá (e. g. , 1000 điểm dữ liệu) và chiến lược thỏa hiệp là 70%/15%/15% cho tập huấn luyện/nhà phát triển/kiểm tra, như sau

import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

Như đã lưu ý, chúng ta cần gieo hạt chia

Nếu gặp vấn đề mất cân bằng lớp, chúng ta nên chia dữ liệu theo cách phân tầng (sử dụng mảng nhãn)

import sqlite3
import pandas as pd

df = pd.read_csv(csv_file) # read csv file
sqlite_file = 'my_db.sqlite3'
conn = sqlite3.connect(sqlite_file) # establish a sqlite3 connection

# if db file exists append the csv
df.to_sql(tablename, conn, if_exists='append', index=False)

đào tạo người mẫu

Nếu chúng ta đã đi xa đến vậy, đào tạo thực sự là phần dễ dàng hơn. Chúng tôi chỉ khởi tạo một bộ phân loại và huấn luyện nó

import sqlite3
import pandas as pd

df = pd.read_csv(csv_file) # read csv file
sqlite_file = 'my_db.sqlite3'
conn = sqlite3.connect(sqlite_file) # establish a sqlite3 connection

# if db file exists append the csv
df.to_sql(tablename, conn, if_exists='append', index=False)

Sự đánh giá

Có một số liệu đánh giá một số cho phép chúng tôi sắp xếp tất cả các mô hình theo hiệu suất của chúng trên số liệu này và nhanh chóng quyết định cái nào hoạt động tốt nhất. Trong hệ thống sản xuất nếu chúng tôi có nhiều (N) chỉ số đánh giá, chúng tôi có thể đặt N-1 tiêu chí là chỉ số 'thỏa mãn', i. e. , chúng tôi chỉ yêu cầu chúng đáp ứng một giá trị nhất định, sau đó xác định giá trị cuối cùng là chỉ số 'tối ưu hóa' mà chúng tôi trực tiếp tối ưu hóa