Cụm văn bản theo python tương tự

Đây là một ví dụ cho thấy cách API scikit-learning có thể được sử dụng để phân cụm tài liệu theo chủ đề bằng cách sử dụng phương pháp Bag of Words

Nội dung chính Show

Đang tải dữ liệu văn bản
Định lượng chất lượng của kết quả phân cụm
K-nghĩa là phân cụm trên các tính năng văn bản
Trích xuất tính năng bằng TfidfVectorizer
Phân cụm dữ liệu thưa thớt với phương tiện k
Thực hiện giảm kích thước bằng LSA
Thuật ngữ hàng đầu trên mỗi cụm
BămVectorizer
Tóm tắt đánh giá phân cụm

Hai thuật toán được demo. và biến thể có khả năng mở rộng hơn của nó,. Ngoài ra, phân tích ngữ nghĩa tiềm ẩn được sử dụng để giảm kích thước và khám phá các mẫu tiềm ẩn trong dữ liệu

Ví dụ này sử dụng hai trình vector hóa văn bản khác nhau. một và một. Xem sổ ghi chép ví dụ để biết thêm thông tin về vectorizers và so sánh thời gian xử lý của chúng

Để phân tích tài liệu thông qua phương pháp học có giám sát, hãy xem tập lệnh ví dụ

# Author: Peter Prettenhofer <[email protected]>
#         Lars Buitinck
#         Olivier Grisel <[email protected]>
#         Arturo Amor <[email protected]>
# License: BSD 3 clause

Đang tải dữ liệu văn bản

Chúng tôi tải dữ liệu từ , bao gồm khoảng 18.000 bài đăng trong nhóm tin về 20 chủ đề. Với mục đích minh họa và để giảm chi phí tính toán, chúng tôi chọn một tập hợp con gồm 4 chủ đề chỉ chiếm khoảng 3.400 tài liệu. Xem ví dụ để có được trực giác về sự chồng chéo của các chủ đề đó

Lưu ý rằng, theo mặc định, các mẫu văn bản chứa một số siêu dữ liệu của tin nhắn, chẳng hạn như

from collections import defaultdict
from sklearn import metrics
from time import time

evaluations = []
evaluations_std = []


def fit_and_evaluate(km, X, name=None, n_runs=5):
    name = km.__class__.__name__ if name is None else name

    train_times = []
    scores = defaultdict(list)
    for seed in range(n_runs):
        km.set_params(random_state=seed)
        t0 = time()
        km.fit(X)
        train_times.append(time() - t0)
        scores["Homogeneity"].append(metrics.homogeneity_score(labels, km.labels_))
        scores["Completeness"].append(metrics.completeness_score(labels, km.labels_))
        scores["V-measure"].append(metrics.v_measure_score(labels, km.labels_))
        scores["Adjusted Rand-Index"].append(
            metrics.adjusted_rand_score(labels, km.labels_)
        )
        scores["Silhouette Coefficient"].append(
            metrics.silhouette_score(X, km.labels_, sample_size=2000)
        )
    train_times = np.asarray(train_times)

    print(f"clustering done in {train_times.mean():.2f} ± {train_times.std():.2f} s ")
    evaluation = {
        "estimator": name,
        "train_time": train_times.mean(),
    }
    evaluation_std = {
        "estimator": name,
        "train_time": train_times.std(),
    }
    for score_name, score_values in scores.items():
        mean_score, std_score = np.mean(score_values), np.std(score_values)
        print(f"{score_name}: {mean_score:.3f} ± {std_score:.3f}")
        evaluation[score_name] = mean_score
        evaluation_std[score_name] = std_score
    evaluations.append(evaluation)
    evaluations_std.append(evaluation_std)

from collections import defaultdict
from sklearn import metrics
from time import time

evaluations = []
evaluations_std = []


def fit_and_evaluate(km, X, name=None, n_runs=5):
    name = km.__class__.__name__ if name is None else name

    train_times = []
    scores = defaultdict(list)
    for seed in range(n_runs):
        km.set_params(random_state=seed)
        t0 = time()
        km.fit(X)
        train_times.append(time() - t0)
        scores["Homogeneity"].append(metrics.homogeneity_score(labels, km.labels_))
        scores["Completeness"].append(metrics.completeness_score(labels, km.labels_))
        scores["V-measure"].append(metrics.v_measure_score(labels, km.labels_))
        scores["Adjusted Rand-Index"].append(
            metrics.adjusted_rand_score(labels, km.labels_)
        )
        scores["Silhouette Coefficient"].append(
            metrics.silhouette_score(X, km.labels_, sample_size=2000)
        )
    train_times = np.asarray(train_times)

    print(f"clustering done in {train_times.mean():.2f} ± {train_times.std():.2f} s ")
    evaluation = {
        "estimator": name,
        "train_time": train_times.mean(),
    }
    evaluation_std = {
        "estimator": name,
        "train_time": train_times.std(),
    }
    for score_name, score_values in scores.items():
        mean_score, std_score = np.mean(score_values), np.std(score_values)
        print(f"{score_name}: {mean_score:.3f} ± {std_score:.3f}")
        evaluation[score_name] = mean_score
        evaluation_std[score_name] = std_score
    evaluations.append(evaluation)
    evaluations_std.append(evaluation_std)

2 (chữ ký) và

from collections import defaultdict
from sklearn import metrics
from time import time

evaluations = []
evaluations_std = []


def fit_and_evaluate(km, X, name=None, n_runs=5):
    name = km.__class__.__name__ if name is None else name

    train_times = []
    scores = defaultdict(list)
    for seed in range(n_runs):
        km.set_params(random_state=seed)
        t0 = time()
        km.fit(X)
        train_times.append(time() - t0)
        scores["Homogeneity"].append(metrics.homogeneity_score(labels, km.labels_))
        scores["Completeness"].append(metrics.completeness_score(labels, km.labels_))
        scores["V-measure"].append(metrics.v_measure_score(labels, km.labels_))
        scores["Adjusted Rand-Index"].append(
            metrics.adjusted_rand_score(labels, km.labels_)
        )
        scores["Silhouette Coefficient"].append(
            metrics.silhouette_score(X, km.labels_, sample_size=2000)
        )
    train_times = np.asarray(train_times)

    print(f"clustering done in {train_times.mean():.2f} ± {train_times.std():.2f} s ")
    evaluation = {
        "estimator": name,
        "train_time": train_times.mean(),
    }
    evaluation_std = {
        "estimator": name,
        "train_time": train_times.std(),
    }
    for score_name, score_values in scores.items():
        mean_score, std_score = np.mean(score_values), np.std(score_values)
        print(f"{score_name}: {mean_score:.3f} ± {std_score:.3f}")
        evaluation[score_name] = mean_score
        evaluation_std[score_name] = std_score
    evaluations.append(evaluation)
    evaluations_std.append(evaluation_std)

3 cho các bài đăng khác. Chúng tôi sử dụng tham số

from collections import defaultdict
from sklearn import metrics
from time import time

evaluations = []
evaluations_std = []


def fit_and_evaluate(km, X, name=None, n_runs=5):
    name = km.__class__.__name__ if name is None else name

    train_times = []
    scores = defaultdict(list)
    for seed in range(n_runs):
        km.set_params(random_state=seed)
        t0 = time()
        km.fit(X)
        train_times.append(time() - t0)
        scores["Homogeneity"].append(metrics.homogeneity_score(labels, km.labels_))
        scores["Completeness"].append(metrics.completeness_score(labels, km.labels_))
        scores["V-measure"].append(metrics.v_measure_score(labels, km.labels_))
        scores["Adjusted Rand-Index"].append(
            metrics.adjusted_rand_score(labels, km.labels_)
        )
        scores["Silhouette Coefficient"].append(
            metrics.silhouette_score(X, km.labels_, sample_size=2000)
        )
    train_times = np.asarray(train_times)

    print(f"clustering done in {train_times.mean():.2f} ± {train_times.std():.2f} s ")
    evaluation = {
        "estimator": name,
        "train_time": train_times.mean(),
    }
    evaluation_std = {
        "estimator": name,
        "train_time": train_times.std(),
    }
    for score_name, score_values in scores.items():
        mean_score, std_score = np.mean(score_values), np.std(score_values)
        print(f"{score_name}: {mean_score:.3f} ± {std_score:.3f}")
        evaluation[score_name] = mean_score
        evaluation_std[score_name] = std_score
    evaluations.append(evaluation)
    evaluations_std.append(evaluation_std)

4 để loại bỏ các tính năng đó và có vấn đề phân cụm hợp lý hơn

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

3387 documents - 4 categories

Định lượng chất lượng của kết quả phân cụm

Trong phần này, chúng tôi xác định một chức năng để chấm điểm các quy trình phân cụm khác nhau bằng cách sử dụng một số chỉ số

Các thuật toán phân cụm về cơ bản là các phương pháp học tập không giám sát. Tuy nhiên, vì chúng tôi tình cờ có nhãn lớp cho tập dữ liệu cụ thể này, nên có thể sử dụng các chỉ số đánh giá tận dụng thông tin sự thật cơ bản “được giám sát” này để định lượng chất lượng của các cụm kết quả. Ví dụ về các số liệu như vậy là như sau

tính đồng nhất, định lượng bao nhiêu cụm chỉ chứa các thành viên của một lớp duy nhất;
tính đầy đủ, định lượng có bao nhiêu thành viên của một lớp nhất định được gán cho cùng một cụm;
V-đo, giá trị trung bình hài hòa của tính đầy đủ và tính đồng nhất;
Chỉ số Rand, đo tần suất các cặp điểm dữ liệu được nhóm một cách nhất quán theo kết quả của thuật toán phân cụm và phân bổ lớp sự thật cơ bản;
Chỉ số Rand được điều chỉnh, Chỉ số Rand được điều chỉnh theo cơ hội sao cho việc gán cụm ngẫu nhiên có ARI bằng 0. 0 trong mong đợi

Nếu không biết nhãn sự thật cơ bản, việc đánh giá chỉ có thể được thực hiện bằng chính kết quả của mô hình. Trong trường hợp đó, Hệ số Silhouette có ích

Để tham khảo thêm, xem

from collections import defaultdict
from sklearn import metrics
from time import time

evaluations = []
evaluations_std = []


def fit_and_evaluate(km, X, name=None, n_runs=5):
    name = km.__class__.__name__ if name is None else name

    train_times = []
    scores = defaultdict(list)
    for seed in range(n_runs):
        km.set_params(random_state=seed)
        t0 = time()
        km.fit(X)
        train_times.append(time() - t0)
        scores["Homogeneity"].append(metrics.homogeneity_score(labels, km.labels_))
        scores["Completeness"].append(metrics.completeness_score(labels, km.labels_))
        scores["V-measure"].append(metrics.v_measure_score(labels, km.labels_))
        scores["Adjusted Rand-Index"].append(
            metrics.adjusted_rand_score(labels, km.labels_)
        )
        scores["Silhouette Coefficient"].append(
            metrics.silhouette_score(X, km.labels_, sample_size=2000)
        )
    train_times = np.asarray(train_times)

    print(f"clustering done in {train_times.mean():.2f} ± {train_times.std():.2f} s ")
    evaluation = {
        "estimator": name,
        "train_time": train_times.mean(),
    }
    evaluation_std = {
        "estimator": name,
        "train_time": train_times.std(),
    }
    for score_name, score_values in scores.items():
        mean_score, std_score = np.mean(score_values), np.std(score_values)
        print(f"{score_name}: {mean_score:.3f} ± {std_score:.3f}")
        evaluation[score_name] = mean_score
        evaluation_std[score_name] = std_score
    evaluations.append(evaluation)
    evaluations_std.append(evaluation_std)

K-nghĩa là phân cụm trên các tính năng văn bản

Hai phương pháp trích xuất tính năng được sử dụng trong ví dụ này

sử dụng từ vựng trong bộ nhớ (một lệnh Python) để ánh xạ các từ thường xuyên nhất tới các chỉ số tính năng và do đó tính toán ma trận tần suất xuất hiện từ (thưa thớt). Các tần số từ sau đó được tính lại bằng cách sử dụng vectơ Tần số tài liệu nghịch đảo (IDF) được thu thập theo tính năng thông minh trên kho văn bản
băm các lần xuất hiện từ thành một không gian chiều cố định, có thể có va chạm. Sau đó, các vectơ đếm từ được chuẩn hóa thành mỗi vectơ có chuẩn l2 bằng một (được chiếu tới khối cầu đơn vị euclide), điều này dường như rất quan trọng đối với phương tiện k hoạt động trong không gian nhiều chiều

Hơn nữa, có thể xử lý hậu kỳ các tính năng được trích xuất đó bằng cách giảm kích thước. Chúng ta sẽ khám phá tác động của những lựa chọn đó đối với chất lượng phân cụm trong phần sau

Trích xuất tính năng bằng TfidfVectorizer

Trước tiên, chúng tôi đánh giá các công cụ ước tính bằng cách sử dụng trình tạo véc tơ từ điển cùng với chuẩn hóa IDF như được cung cấp bởi

________số 8_______

vectorization done in 0.444 s
n_samples: 3387, n_features: 7929

Sau khi bỏ qua các thuật ngữ xuất hiện trong hơn 50% tài liệu (do

from collections import defaultdict
from sklearn import metrics
from time import time

evaluations = []
evaluations_std = []


def fit_and_evaluate(km, X, name=None, n_runs=5):
    name = km.__class__.__name__ if name is None else name

    train_times = []
    scores = defaultdict(list)
    for seed in range(n_runs):
        km.set_params(random_state=seed)
        t0 = time()
        km.fit(X)
        train_times.append(time() - t0)
        scores["Homogeneity"].append(metrics.homogeneity_score(labels, km.labels_))
        scores["Completeness"].append(metrics.completeness_score(labels, km.labels_))
        scores["V-measure"].append(metrics.v_measure_score(labels, km.labels_))
        scores["Adjusted Rand-Index"].append(
            metrics.adjusted_rand_score(labels, km.labels_)
        )
        scores["Silhouette Coefficient"].append(
            metrics.silhouette_score(X, km.labels_, sample_size=2000)
        )
    train_times = np.asarray(train_times)

    print(f"clustering done in {train_times.mean():.2f} ± {train_times.std():.2f} s ")
    evaluation = {
        "estimator": name,
        "train_time": train_times.mean(),
    }
    evaluation_std = {
        "estimator": name,
        "train_time": train_times.std(),
    }
    for score_name, score_values in scores.items():
        mean_score, std_score = np.mean(score_values), np.std(score_values)
        print(f"{score_name}: {mean_score:.3f} ± {std_score:.3f}")
        evaluation[score_name] = mean_score
        evaluation_std[score_name] = std_score
    evaluations.append(evaluation)
    evaluations_std.append(evaluation_std)

9 đặt) và các thuật ngữ không có trong ít nhất 5 tài liệu (do

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=5,
    stop_words="english",
)
t0 = time()
X_tfidf = vectorizer.fit_transform(dataset.data)

print(f"vectorization done in {time() - t0:.3f} s")
print(f"n_samples: {X_tfidf.shape[0]}, n_features: {X_tfidf.shape[1]}")

0 đặt), số lượng thuật ngữ duy nhất

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=5,
    stop_words="english",
)
t0 = time()
X_tfidf = vectorizer.fit_transform(dataset.data)

print(f"vectorization done in {time() - t0:.3f} s")
print(f"n_samples: {X_tfidf.shape[0]}, n_features: {X_tfidf.shape[1]}")

1 thu được là khoảng 8.000. Ngoài ra, chúng ta có thể định lượng độ thưa thớt của ma trận

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=5,
    stop_words="english",
)
t0 = time()
X_tfidf = vectorizer.fit_transform(dataset.data)

print(f"vectorization done in {time() - t0:.3f} s")
print(f"n_samples: {X_tfidf.shape[0]}, n_features: {X_tfidf.shape[1]}")

2 dưới dạng phần của các mục nhập khác 0 được chia cho tổng số phần tử

print(f"{X_tfidf.nnz / np.prod(X_tfidf.shape):.3f}")

0.007

Chúng tôi thấy rằng khoảng 0. 7% mục nhập của ma trận

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=5,
    stop_words="english",
)
t0 = time()
X_tfidf = vectorizer.fit_transform(dataset.data)

print(f"vectorization done in {time() - t0:.3f} s")
print(f"n_samples: {X_tfidf.shape[0]}, n_features: {X_tfidf.shape[1]}")

2 là khác không

Phân cụm dữ liệu thưa thớt với phương tiện k

Vì cả hai và tối ưu hóa một hàm mục tiêu không lồi, việc phân cụm của chúng không được đảm bảo là tối ưu cho một init ngẫu nhiên nhất định. Hơn nữa, trên dữ liệu chiều cao thưa thớt, chẳng hạn như văn bản được vector hóa bằng cách sử dụng phương pháp Bag of Words, k-mean có thể khởi tạo trọng tâm trên các điểm dữ liệu cực kỳ biệt lập. Những điểm dữ liệu đó có thể luôn là trọng tâm của riêng chúng

Đoạn mã sau minh họa hiện tượng trước đó đôi khi có thể dẫn đến các cụm mất cân bằng cao như thế nào, tùy thuộc vào việc khởi tạo ngẫu nhiên

from sklearn.cluster import KMeans

for seed in range(5):
    kmeans = KMeans(
        n_clusters=true_k,
        max_iter=100,
        n_init=1,
        random_state=seed,
    ).fit(X_tfidf)
    cluster_ids, cluster_sizes = np.unique(kmeans.labels_, return_counts=True)
    print(f"Number of elements asigned to each cluster: {cluster_sizes}")
print()
print(
    "True number of documents in each category according to the class labels: "
    f"{category_sizes}"
)

Number of elements asigned to each cluster: [   1    1 3384    1]
Number of elements asigned to each cluster: [1733  717  238  699]
Number of elements asigned to each cluster: [1115  256 1417  599]
Number of elements asigned to each cluster: [1695  649  446  597]
Number of elements asigned to each cluster: [ 254 2117  459  557]

True number of documents in each category according to the class labels: [799 973 987 628]

Để tránh vấn đề này, một khả năng là tăng số lần chạy với các khởi tạo ngẫu nhiên độc lập

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=5,
    stop_words="english",
)
t0 = time()
X_tfidf = vectorizer.fit_transform(dataset.data)

print(f"vectorization done in {time() - t0:.3f} s")
print(f"n_samples: {X_tfidf.shape[0]}, n_features: {X_tfidf.shape[1]}")

6. Trong trường hợp như vậy, việc phân cụm có quán tính tốt nhất (hàm mục tiêu của phương tiện k) được chọn

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

Tất cả các số liệu đánh giá phân cụm đó có giá trị tối đa là 1. 0 (để có kết quả phân cụm hoàn hảo). Giá trị cao hơn là tốt hơn. Các giá trị của Chỉ số Rand đã điều chỉnh gần bằng 0. 0 tương ứng với một nhãn ngẫu nhiên. Lưu ý từ các điểm số ở trên rằng việc phân công cụm thực sự cao hơn mức cơ hội, nhưng chất lượng tổng thể chắc chắn có thể cải thiện

Hãy nhớ rằng các nhãn lớp có thể không phản ánh chính xác các chủ đề tài liệu và do đó, các chỉ số sử dụng nhãn không nhất thiết là tốt nhất để đánh giá chất lượng của quy trình phân cụm của chúng tôi

Thực hiện giảm kích thước bằng LSA

Một

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=5,
    stop_words="english",
)
t0 = time()
X_tfidf = vectorizer.fit_transform(dataset.data)

print(f"vectorization done in {time() - t0:.3f} s")
print(f"n_samples: {X_tfidf.shape[0]}, n_features: {X_tfidf.shape[1]}")

7 vẫn có thể được sử dụng miễn là kích thước của không gian được véc tơ hóa trước tiên được giảm để làm cho phương tiện k ổn định hơn. Với mục đích như vậy, chúng tôi sử dụng , hoạt động trên ma trận số lượng thuật ngữ/tf-idf. Do kết quả SVD không được chuẩn hóa nên chúng tôi thực hiện lại quá trình chuẩn hóa để cải thiện kết quả. Sử dụng SVD để giảm kích thước của vectơ tài liệu TF-IDF thường được gọi là phân tích ngữ nghĩa tiềm ẩn (LSA) trong tài liệu khai thác văn bản và truy xuất thông tin

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

Sử dụng một lần khởi tạo duy nhất có nghĩa là thời gian xử lý sẽ giảm cho cả hai và

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

Chúng ta có thể quan sát thấy rằng việc phân cụm trên biểu diễn LSA của tài liệu nhanh hơn đáng kể (cả vì

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=5,
    stop_words="english",
)
t0 = time()
X_tfidf = vectorizer.fit_transform(dataset.data)

print(f"vectorization done in {time() - t0:.3f} s")
print(f"n_samples: {X_tfidf.shape[0]}, n_features: {X_tfidf.shape[1]}")

7 và vì kích thước của không gian đặc trưng LSA nhỏ hơn nhiều). Hơn nữa, tất cả các số liệu đánh giá phân cụm đã được cải thiện. Chúng tôi lặp lại thí nghiệm với

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

Thuật ngữ hàng đầu trên mỗi cụm

Vì có thể đảo ngược, chúng tôi có thể xác định các trung tâm cụm, cung cấp trực giác về các từ có ảnh hưởng nhất cho mỗi cụm. Xem tập lệnh ví dụ để so sánh với các từ dễ đoán nhất cho từng lớp mục tiêu

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

import numpy as np
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

BămVectorizer

Một vector hóa thay thế có thể được thực hiện bằng cách sử dụng một thể hiện, không cung cấp trọng số IDF vì đây là một mô hình không trạng thái (phương thức phù hợp không làm gì cả). Khi cần trọng số IDF, nó có thể được thêm vào bằng cách chuyển đầu ra thành một thể hiện. Trong trường hợp này, chúng tôi cũng thêm LSA vào đường ống để giảm kích thước và độ thưa thớt của không gian vectơ được băm

3387 documents - 4 categories

3387 documents - 4 categories

Người ta có thể quan sát rằng bước LSA mất một thời gian tương đối dài để phù hợp, đặc biệt là với các vectơ băm. Lý do là một không gian băm thường lớn (được đặt thành

vectorization done in 0.444 s
n_samples: 3387, n_features: 7929

8 trong ví dụ này). Người ta có thể thử giảm số lượng tính năng với chi phí có phần lớn hơn các tính năng có xung đột băm như trong sổ ghi chép ví dụ

Bây giờ chúng tôi điều chỉnh và đánh giá các trường hợp

vectorization done in 0.444 s
n_samples: 3387, n_features: 7929

9 và

print(f"{X_tfidf.nnz / np.prod(X_tfidf.shape):.3f}")

0 trên dữ liệu đã được băm-lsa-reduced này

3387 documents - 4 categories

3387 documents - 4 categories

3387 documents - 4 categories

3387 documents - 4 categories

Cả hai phương pháp đều dẫn đến kết quả tốt tương tự như chạy cùng một mô hình trên các vectơ LSA truyền thống (không băm)

Tóm tắt đánh giá phân cụm

3387 documents - 4 categories

và mắc phải hiện tượng được gọi là Lời nguyền của chiều đối với các bộ dữ liệu chiều cao như dữ liệu văn bản. Đó là lý do tại sao điểm tổng cải thiện khi sử dụng LSA. Sử dụng LSA giảm dữ liệu cũng cải thiện độ ổn định và yêu cầu thời gian phân cụm thấp hơn, mặc dù vậy, hãy nhớ rằng bản thân bước LSA mất nhiều thời gian, đặc biệt là với các vectơ được băm

Hệ số Silhouette được xác định trong khoảng từ 0 đến 1. Trong mọi trường hợp, chúng tôi nhận được các giá trị gần bằng 0 (ngay cả khi chúng cải thiện một chút sau khi sử dụng LSA) vì định nghĩa của nó yêu cầu đo khoảng cách, trái ngược với các số liệu đánh giá khác như thước đo V và Chỉ số Rand được điều chỉnh chỉ dựa trên cụm . Lưu ý rằng nói đúng ra, người ta không nên so sánh Hệ số Hình bóng giữa các không gian có kích thước khác nhau, do các khái niệm khác nhau về khoảng cách mà chúng ngụ ý

Tính đồng nhất, tính đầy đủ và do đó số liệu đo lường v không mang lại cơ sở liên quan đến ghi nhãn ngẫu nhiên. điều này có nghĩa là tùy thuộc vào số lượng mẫu, cụm và lớp chân lý cơ bản, việc ghi nhãn hoàn toàn ngẫu nhiên sẽ không phải lúc nào cũng mang lại các giá trị giống nhau. Cụ thể, việc ghi nhãn ngẫu nhiên sẽ không mang lại điểm 0, đặc biệt khi số lượng cụm lớn. Vấn đề này có thể được bỏ qua một cách an toàn khi số lượng mẫu lớn hơn một nghìn và số lượng cụm nhỏ hơn 10, đó là trường hợp của ví dụ hiện tại. Đối với cỡ mẫu nhỏ hơn hoặc số lượng cụm lớn hơn, sẽ an toàn hơn khi sử dụng chỉ số được điều chỉnh, chẳng hạn như Chỉ số Rand được điều chỉnh (ARI). Xem ví dụ về bản trình diễn về tác dụng của việc ghi nhãn ngẫu nhiên

Kích thước của các thanh lỗi cho thấy nó kém ổn định hơn so với tập dữ liệu tương đối nhỏ này. Sẽ thú vị hơn khi sử dụng khi số lượng mẫu lớn hơn nhiều, nhưng nó có thể phải trả giá bằng sự suy giảm nhỏ về chất lượng phân cụm so với thuật toán k-means truyền thống