annoy.Index to NPY or CSV with examples#

An example showing the Index class.

See also

import random; random.seed(0)

# from annoy import Annoy, AnnoyIndex
from scikitplot.annoy import Annoy, AnnoyIndex, Index

print(AnnoyIndex.__doc__)

High-level Pythonic Annoy wrapper with picklable (or pickle-able).

Minimal modify spotify/annoy low-level C-API to extend Python API.

.. seealso::
    * :py:obj:`~scikitplot.annoy.Index.from_low_level`
    * https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled

import random
from pathlib import Path

random.seed(0)

HERE = Path.cwd().resolve()
OUT = HERE / "../../../scikitplot/annoy/tests" / "test_v2.tree"

f = 10
n = 1000
idx = AnnoyIndex(f, "angular")
for i in range(n):
    idx.add_item(i, [random.gauss(0, 1) for _ in range(f)])

idx.build(10)
idx.save(str(OUT))
print("Wrote", OUT)
idx

Wrote /home/circleci/repo/galleries/examples/annoy/../../../scikitplot/annoy/tests/test_v2.tree

Annoy(f=10, metric='angular', n_items=1000, n_trees=10, on_disk_path=/home/circleci/repo/galleries/examples/annoy/../../../scikitplot/annoy/tests/test_v2.tree)

Small subset → DataFrame/CSV

df = idx.to_dataframe(start=0, stop=1000)
df.to_csv("sample.csv", index=False)

import pandas as pd

pd.read_csv("sample.csv")

	id	feature_0	feature_1	feature_2	feature_3	feature_4	feature_5	feature_6	feature_7	feature_8	feature_9
0	0	0.941715	-1.396578	-0.679714	0.370504	-1.016349	-0.072120	0.179196	-0.831099	-1.309037	0.193888
1	1	0.993250	-0.646982	-0.333668	1.645672	-0.558890	-0.514157	2.404119	-1.531083	0.796466	-2.003649
2	2	-0.596963	1.503681	1.221436	-0.901120	-0.453699	0.080233	-1.258103	0.552220	2.227577	-1.355241
3	3	-1.981533	0.288244	-0.119123	1.804330	-0.160362	-0.050660	-0.190874	-0.990606	0.673030	-1.324083
4	4	1.166490	0.008376	0.503630	-0.552765	-0.920194	1.800263	0.468550	1.207003	0.187123	2.611608
...	...	...	...	...	...	...	...	...	...	...	...
995	995	-0.764022	0.174524	-0.816212	0.623093	-0.395465	0.193787	-0.769984	-0.147106	0.377592	-0.230512
996	996	0.812510	-1.125429	-0.725055	1.007468	-1.236581	-0.339250	0.958843	-0.857818	1.487129	0.667199
997	997	1.509753	0.877829	-0.604218	0.013888	-0.597203	1.374362	0.723732	-1.195797	0.084885	-0.644913
998	998	-0.479956	-0.314434	2.384329	-1.387915	1.522265	0.047036	0.547916	0.307560	-0.234338	-0.743033
999	999	1.797861	0.535607	0.371127	0.373999	1.999118	-1.771545	-0.133898	-0.841187	-0.977023	-0.905645

1000 rows × 11 columns

Streaming CSV (warning: huge)

idx.to_csv("annoy_vectors.csv", start=0, stop=100_000)

'annoy_vectors.csv'

import pandas as pd

pd.read_csv("annoy_vectors.csv")

	id	feature_0	feature_1	feature_2	feature_3	feature_4	feature_5	feature_6	feature_7	feature_8	feature_9
0	0	0.941715	-1.396578	-0.679714	0.370504	-1.016349	-0.072120	0.179196	-0.831099	-1.309037	0.193888
1	1	0.993250	-0.646982	-0.333668	1.645672	-0.558890	-0.514157	2.404119	-1.531083	0.796466	-2.003649
2	2	-0.596963	1.503681	1.221436	-0.901120	-0.453699	0.080233	-1.258103	0.552220	2.227577	-1.355242
3	3	-1.981533	0.288244	-0.119123	1.804330	-0.160362	-0.050660	-0.190874	-0.990606	0.673030	-1.324082
4	4	1.166490	0.008376	0.503630	-0.552765	-0.920194	1.800263	0.468550	1.207003	0.187123	2.611608
...	...	...	...	...	...	...	...	...	...	...	...
995	995	-0.764022	0.174524	-0.816212	0.623093	-0.395465	0.193787	-0.769984	-0.147106	0.377592	-0.230512
996	996	0.812510	-1.125429	-0.725055	1.007468	-1.236580	-0.339250	0.958843	-0.857817	1.487129	0.667198
997	997	1.509753	0.877829	-0.604218	0.013888	-0.597203	1.374362	0.723732	-1.195797	0.084885	-0.644913
998	998	-0.479956	-0.314434	2.384329	-1.387915	1.522265	0.047036	0.547916	0.307560	-0.234338	-0.743033
999	999	1.797861	0.535607	0.371127	0.373999	1.999118	-1.771545	-0.133898	-0.841187	-0.977023	-0.905645

1000 rows × 11 columns

Large export → memory-safe .npy Exports items [0, n_items) into a memmapped .npy

idx.save_vectors_npy("annoy_vectors.npy")

'annoy_vectors.npy'

import numpy as np

np.load("annoy_vectors.npy")

array([[ 0.9417154 , -1.3965781 , -0.67971444, ..., -0.8310992 ,
        -1.3090373 ,  0.19388774],
       [ 0.9932497 , -0.64698166, -0.333668  , ..., -1.5310826 ,
         0.7964658 , -2.0036485 ],
       [-0.59696275,  1.5036808 ,  1.2214364 , ...,  0.55222   ,
         2.2275772 , -1.3552415 ],
       ...,
       [ 1.5097532 ,  0.8778289 , -0.6042179 , ..., -1.1957974 ,
         0.0848854 , -0.64491284],
       [-0.47995627, -0.31443435,  2.3843286 , ...,  0.30755976,
        -0.23433805, -0.7430332 ],
       [ 1.7978611 ,  0.53560704,  0.37112716, ..., -0.8411868 ,
        -0.9770226 , -0.90564495]], shape=(1000, 10), dtype=float32)

Range-only export (strict, sized)

idx.save_vectors_npy("chunk_0_1m.npy", start=0, stop=1_000_000)

'chunk_0_1m.npy'

import numpy as np

np.load("chunk_0_1m.npy")

array([[ 0.9417154 , -1.3965781 , -0.67971444, ..., -0.8310992 ,
        -1.3090373 ,  0.19388774],
       [ 0.9932497 , -0.64698166, -0.333668  , ..., -1.5310826 ,
         0.7964658 , -2.0036485 ],
       [-0.59696275,  1.5036808 ,  1.2214364 , ...,  0.55222   ,
         2.2275772 , -1.3552415 ],
       ...,
       [ 1.5097532 ,  0.8778289 , -0.6042179 , ..., -1.1957974 ,
         0.0848854 , -0.64491284],
       [-0.47995627, -0.31443435,  2.3843286 , ...,  0.30755976,
        -0.23433805, -0.7430332 ],
       [ 1.7978611 ,  0.53560704,  0.37112716, ..., -0.8411868 ,
        -0.9770226 , -0.90564495]], shape=(1000, 10), dtype=float32)

Tags: level: beginner purpose: showcase

Total running time of the script: (0 minutes 0.068 seconds)

Related examples