extract_archive#

scikitplot.corpus.extract_archive(archive_path, output_path, *, supported_extensions=None, max_files=10000, max_total_bytes=2147483648)[source]#

Extract an archive to a destination directory.

Parameters:
archive_pathstr or Path

Path to the archive file.

output_pathstr or Path

Directory to extract files into. Created if it does not exist.

supported_extensionsfrozenset[str] or None, optional

Whitelist of file extensions to include from the archive. If None, all files are included (subject to hidden-file and __pycache__ filtering). Default: None.

max_filesint, optional

Maximum number of files allowed in the archive. Archives exceeding this limit are rejected before extraction begins. Default: 10,000.

max_total_bytesint, optional

Maximum cumulative extracted size in bytes. Extraction halts if this limit is exceeded (zip-bomb prevention). Default: 2 GB.

Returns:
list[Path]

Sorted list of extracted file paths (absolute).

Raises:
ValueError

If the file is not a recognised archive format.

ValueError

If the archive contains more than max_files members.

ValueError

If cumulative extracted bytes exceed max_total_bytes.

OSError

If the archive cannot be opened.

Parameters:
Return type:

list[Path]

Notes

ZipSlip prevention: Every extracted member’s resolved path is verified to fall within output_path. Members with path-traversal components (../) are logged as warnings and skipped.

Symlinks: Symbolic links inside archives are always skipped.

Examples

>>> from pathlib import Path
>>> files = extract_archive("corpus.zip", "/tmp/corpus_extract")
>>> [f.suffix for f in files]
['.pdf', '.txt', '.txt']