DatasetsLoader
The core of our SDK is the integration of Python's Polars library, chosen for its efficiency in handling large datasets. Polars enables quick data processing and manipulation, which is vital for data analysis and machine learning. Our DatasetsLoader, built on Polars, offers an easy-to-use solution for loading various datasets, making the process smoother and more efficient for data-driven projects.
DatasetsLoader
Locating reliable, easily reproducible datasets can often be a challenge. A key aim of the Giza Datasets SDK is to simplify the process of accessing datasets of various formats and types. The most straightforward way to start is to explore the Dataset Library or use the DatasetsHub.
Assuming that we have already know the name of the dataset we want to load, we can now use the DatasetLoader
to load it.
By default, DatasetsLoader has the use_cache option enabled to improve the loading performance of our datasets. If you want to disable it, add the following parameter when initializing your class:
If you want to learn more about cache management, visit the Cache management section.
Depending on your device's configuration, it may be necessary to provide SSL certificates to verify the authenticity of HTTPS connections. You can ensure that all these certifications are correct by executing the following line of code:
Once we have our datasetsLoader class created and our certificates correct, we are ready to load one of our datasets.
shape: (5, 7)
Keep in mind that giza-datasets uses Polars (and not Pandas) as the underlying DataFrame library.
In addition, if we have the option use_cache = True (default option), the load method allows us to load our data in eager mode. With this mode, we will obtain several advantages both in memory and time:
For more detailed information on the advantages and use of this mode, visit our Eager mode section.
Success! We can now use the loaded dataset for ML development.
Last updated