Open Source All About Data Processing, Dataverse

2024/03/26 | Written By: Eujeong Choi (Technical Writer), Chanjun Park (AI Research Engineer)

Dataverse is a freely-accessible open-source project designed to streamline the extract, transform, and load (ETL) pipeline using Python. In this post, we delve into the origins of this project and shed light on its future prospects in the realm of open-source data processing.

How It All Started

Data Processing as a Fundamental Part of LLM Ecosystem

In the domain of data pre-processing, especially within the Large Language Model (LLM) sphere, the importance of robust data pre-processing techniques cannot be overstated. Despite this significance, the availability of open-source pre-processing models tailored to this domain remains scarce. Recognizing the pivotal role of data pre-processing in nurturing a vibrant open-source ecosystem, Upstage embarked on a mission to contribute to this critical aspect, aiming for a win-win growth for both us and those within the LLM ecosystem. By introducing Dataverse, Upstage aims not only to bridge this gap in our community by sharing evolving data engineering techniques and make it accessible in one cue.

Promoting Fairness through Transparency

Another driving force behind the launch of this open-source initiative was to ensure transparency in profit-sharing. Concerns and complaints have arisen regarding the varying costs of LLM APIs across different languages. This discrepancy is primarily attributed to the pricing structure of APIs, as highlighted in a paper titled "Language Model Tokenizers Introduce Unfairness Between Languages" by Petrov et al. Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs. This disparity results in unfair treatment for certain language communities in terms of access costs, processing time, latency, and contextual content provision to the models. Transparency is crucial, especially in profit-sharing endeavors within data pre-processing. Dataverse prioritizes transparency in pre-processing methodologies to ensure fair and equitable distribution of benefits across stakeholders regardless of their language, particularly within Upstage's ecosystem.

What is Dataverse?

Overview

At its core, Dataverse is a user-friendly, standardized solution for data processing and management, tailored to meet the demands of data scientists, analysts, and developers in the LLM era. Even for those unfamiliar with complex frameworks like Spark, Dataverse offers a straightforward approach to data pre-processing.

Key Features

The main standout characteristic of Dataverse is its flexibility. Users have the freedom to define custom functions, allowing for a more tailored and adaptable pre-processing experience.

Block-Based: In Dataverse, a block means a registered ETL function which is running on Spark. You can build Spark code like putting together puzzle pieces. You can easily add, take away, or re-arrange pieces to get the results you want via configure.
Configure-Based: All the setups for Spark and steps of block can be defined with configure. You don't need to know all the code. Just set up the options, and you're good to go.
Extensible: It's designed to meet your specific demands, allowing for custom features that fit perfectly with your project.

License

Dataverse operates under the Apache License 2.0.

For more detailed information, please visit our Dataverse documentation page: Dataverse Documentation

Use Cases

Dataverse shines brightest when tackling large-scale text data pre-processing tasks. Moreover, it serves as a centralized hub for consolidating a myriad of pre-processing functionalities scattered across different libraries.

Within Upstage, we've extensively utilized Dataverse for pre-processing datasets to train projects like Solar Mini and Up 1T Token Club. For instance, we employed Dataverse to deduplicate vast text data provided by our partner corporations for systematic cleansing and enhancement. One example of data pre-processing involved downsizing a patent dataset to 30% of its original size without compromising quality or distribution. By ensuring the datasets' quality aligns with the requirements for training Large Language Models (LLMs), not only did this help in creating the groundwork data for training our models, but it also became a foundation for our team’s collaboration by delivering consistent results through standardized processing codes. Dataverse establishes a stable pre-processing foundation for your team, and its customizable nature allows for further enhancements.

Future Work and Contribution Points

Looking ahead, Dataverse harbors ambitious plans to expand its repertoire of preprocessing functions to encompass multimodal data types, including images and videos. Our vision extends to processing unstructured data regardless of its modality, ensuring uniform processing through Dataverse across various data types and formats.

Our vision for the evolution of Dataverse

Summary

In conclusion, Dataverse emerges as a promising contender in the realm of open-source data pre-processing, driven by a steadfast commitment to transparency, flexibility, and community collaboration.
We highly encourage contributions from the community in the form of custom pre-processing functions and Spark optimization to propel Dataverse towards its full potential.

Shape the future of data processing with us on the Dataverse ecosystem via GitHub!