Why spend the effort to write about Apache Arrow
I came across Apache Arrow when I was looking into the Hugging Face library and they made a big deal of it. I was intrigued, so i searched for more information about Apache Arrow and I found Wes McKinney was a big part of it. I’ve followed Wes from his time at Pandas and everything he touches turns to gold so I started a thread of questions to ask myself, what is Apache Arrow, what it means to the industry and how might this help me in my job. I looked at other projects that will benefit from Apache Arrow, Wes is part of some of those too. Time to dive deeper.
Some of the projects I’ve heard about using Apache Arrow are :
- Pandas a data frame library for manipulating data and presenting said data.
- Polars similar to pandas but written in Rust and are much faster but not as feature-rich as pandas and as many integrations.
- Hugging Face Datasets and SafeTensors a library for AI and machine learning.
- Apache Spark is a big data processing library that is used in many big data pipelines and is also used as part of Databricks.
- R is a programming lanaguage for data analysis and visualisation.
- InfluxDB which I’ve used extensively in work. Its a time series database used for sensor data and other time series data like servers.
What is Apache Arrow
In summary, it’s an in-memory columnar format that is not designed for storage but for in-memory use, hence the in-memory. Columular format allows acceleration on multicore CPUs and GPUs for faster data processing and faster due to the columns all having the same datatype (rows are usually different data types), and less waiting for your queries. The memory in use part means that data is optimized to be processed straight away whereas some formats are compressed or need to be converted, again slowing the data pipeline down.
Why are people (and companies) developing with Apache Arrow
Why invent something that already exists. If you’re creating a new data manipulation library, why not use Apache Arrow as the data backend, it’s widely used and lots of people are actively making it better and it has Apache Arrow behind it with all their governance. It has a big community behind it and is well-maintained. Just create an issue on their git page to get issues fixed and it’s all transparent.
The thing that interests me for my day job is moving data between one library (e.g Polars) and another (e.g. Pandas) or one programming language (e.g. Python) and another (e.g Rust), using something they call zero copy which I understand means the data stays in the same format/location and you just pass the pointer to the different libraries.
I listened to a podcast by the creator of Influx DB listen here and his take was that if you spend all the time to developing a data backend. By the time it’s finished, you probably already lost the lead against the competition as competition has already built using Apache Arrow and grabbed your customers. It’s also hard running a big team of developers, not just technically but dealing with people and all the other things that come with it. So why not use Apache Arrow and work on customer experience features that make your product unique. He also mentioned that the Apache Arrow team are doing a great job, if you can’t beat them, join them!
Hugging Faces is using it for its efficient handling of data which is incredibly important in AI and Machine Learning pipelines. They make a lot of reference to the zero-copy of the data for things like saving weights of (large) models.
How will this affect me and the tasks I do regularly
For work, I perform data manipulation in Pandas but it’s slow whereas something like Polar is quick but is still young so doesn’t have all the features and integrations that Pandas has. I have some slow routines in Pandas where it’s looping through GBs of data, I’m going to use Polar for the data-intensive part and then pass to Pandas for things like charting and other integrations. I’m mostly seeing positives from libraries using Apache Arrow rather than companies/developers designing their data backends. From an end-user perspective, it’s great that I can use different tools throughout the pipeline and not have any incompatibility issues because the data is always the same datatype. This means I can use the best tool for the job at any particular part of the data pipeline.
I’ve also heard good things about DuckDB which is a in-memory database that uses Apache Arrow.
I use Pytorch pretty much every day and I’ve seen a project on their website torcharrow that’s in beta. It’s said to be similar to torch.tensor but I’ve not used it yet, I will when I get time or maybe when it’s more mature.
It is going to be interesting to see where else Apache Arrow pops up as I keep hearing more about it as the days go by.