extract_archive#

scikitplot.corpus.extract_archive(archive_path, output_path, *, supported_extensions=None, max_files=10000, max_total_bytes=2147483648)[source]#

Extract an archive to a destination directory.

Parameters:

archive_pathstr or Path: Path to the archive file.
output_pathstr or Path: Directory to extract files into. Created if it does not exist.
supported_extensionsfrozenset[str] or None, optional: Whitelist of file extensions to include from the archive. If None, all files are included (subject to hidden-file and __pycache__ filtering). Default: None.
max_filesint, optional: Maximum number of files allowed in the archive. Archives exceeding this limit are rejected before extraction begins. Default: 10,000.
max_total_bytesint, optional: Maximum cumulative extracted size in bytes. Extraction halts if this limit is exceeded (zip-bomb prevention). Default: 2 GB.

Returns:

list[Path]: Sorted list of extracted file paths (absolute).

Raises:

ValueError: If the file is not a recognised archive format.
ValueError: If the archive contains more than max_files members.
ValueError: If cumulative extracted bytes exceed max_total_bytes.
OSError: If the archive cannot be opened.

Parameters:

archive_path (str | Path)
output_path (str | Path)
supported_extensions (frozenset[str] | None)
max_files (int)
max_total_bytes (int)

Return type:

list[Path]

Notes

ZipSlip prevention: Every extracted member’s resolved path is verified to fall within output_path. Members with path-traversal components (../) are logged as warnings and skipped.

Symlinks: Symbolic links inside archives are always skipped.

Examples

>>> from pathlib import Path
>>> files = extract_archive("corpus.zip", "/tmp/corpus_extract")
>>> [f.suffix for f in files]
['.pdf', '.txt', '.txt']