extract_archive#
- scikitplot.corpus.extract_archive(archive_path, output_path, *, supported_extensions=None, max_files=10000, max_total_bytes=2147483648)[source]#
Extract an archive to a destination directory.
- Parameters:
- archive_pathstr or Path
Path to the archive file.
- output_pathstr or Path
Directory to extract files into. Created if it does not exist.
- supported_extensionsfrozenset[str] or None, optional
Whitelist of file extensions to include from the archive. If
None, all files are included (subject to hidden-file and__pycache__filtering). Default:None.- max_filesint, optional
Maximum number of files allowed in the archive. Archives exceeding this limit are rejected before extraction begins. Default: 10,000.
- max_total_bytesint, optional
Maximum cumulative extracted size in bytes. Extraction halts if this limit is exceeded (zip-bomb prevention). Default: 2 GB.
- Returns:
- list[Path]
Sorted list of extracted file paths (absolute).
- Raises:
- ValueError
If the file is not a recognised archive format.
- ValueError
If the archive contains more than max_files members.
- ValueError
If cumulative extracted bytes exceed max_total_bytes.
- OSError
If the archive cannot be opened.
- Parameters:
- Return type:
Notes
ZipSlip prevention: Every extracted member’s resolved path is verified to fall within output_path. Members with path-traversal components (
../) are logged as warnings and skipped.Symlinks: Symbolic links inside archives are always skipped.
Examples
>>> from pathlib import Path >>> files = extract_archive("corpus.zip", "/tmp/corpus_extract") >>> [f.suffix for f in files] ['.pdf', '.txt', '.txt']