How Lyft Uses Big Data in its Operations

Lyft started late in the ride-hailing business compared to Uber, which virtually created the industry / Photo by: Piotr Swat via 123RF

 

Lyft started late in the ride-hailing business compared to Uber, which virtually created the industry, but being a late entrant enabled the company to design its cloud-based big data platform without encountering the problems that Uber faced when it built its on-premises system. Alex Woodie, writing for datanami, reported that the two companies did not hesitate to share information about their computer infrastructure.

An atmosphere of openness about how they use and develop technology pervades both firms, similar to other Silicon Valley technology firms, such as Google, Facebook, and Twitter, which are responsible for creating the big data ecosystem.

 

Hadoop Distributed File System

When Lyft was designing its computer system that will allow its mobile app to everything it has to do, it decided to study how Uber approached the issue. While the former stores its data on cloud, Uber decided to invest in Apache Hadoop infrastructure.

Li Gao, a data engineer at Lyft, said storing data on cloud allowed the company to bypass the technical difficulties that Uber experienced in implementing the Hadoop Distributed File System. Raw and normalized data is stored by Lyft on Amazon Web Services 3 and uses the Amazon Web Services EC2 in processing the data.

Its cloud-based expenditures are quite high, which is normal for a company that operates in 600 cities and earned US$2.1 billion in revenues in 2018. Lyft pays Amazon a minimum of US$8 million each month, and the company expects to spend up to US$ 300 million in storing its data on cloud through 2021.

Lyft is not a captive customer of Amazon but also uses Kinesis message bus and Redshift data warehouse. It has also migrated its data processing and analytics infrastructure to other systems, but which Amazon Web Services still hosts and bills.

When Lyft was designing its computer system that will allow its mobile app to everything it has to do, it decided to study how Uber approached the issue / Photo by: Piotr Adamowicz via 123RF

 

Apache Hive

During its start-up days, Lyft relied heavily on Redshift but in 2016 shifted to Apache Hive after it suffered scalability issues due to the tight coupling of processing and storage. Gao said that Amazon Web Services has since resolved the issue of processing and storage that hounded the company in 2016.

Presently, Apache Hive is being used by Lyft for big ETL (Extract, Transform, Load) jobs that supply data to company executives and business analysts. By 2018, Lyft was using a high-speed version of Hive that permitted the company to increase the volume and number of ETL jobs that could be processed. For a more powerful query engine, Lyft also rolled out Presto (the successor of Hive at Facebook). Presto’s strength lies in its ability to use different data in ad hoc analysis.

Apache Spark

For multiple use cases, Lyft made use of Apache Spark, including ETL batch processing and training machine learning models. It also uses Apache Druid, which is a column-oriented in-memory OLAP (Online Analytical Processing) data storage for doing drill-downs and roll-ups on a large set of high-dimensional data. The ride-hailing company also relies on Apache Superset, which incorporates an SQL editor with interactive querying.

Gao said Lyft uses Hive and Spark for batch processing and Presto for interactive systems running on interactive data sets. He added that they use Druid for fast metrics data storage and Superset in optimizing user interface tools used in internal dashboards and metrics visualization.

For multiple use cases, Lyft made use of Apache Spark, including ETL batch processing and training machine learning models / Photo by: Dion Hinchcliffe via Flickr

 

Druid and Presto

The company has also built numerous data pipelines that send incoming data to data marts and processing engines. Lyft is currently building a connection between Druid and Presto that lets developers leverage the relative advantages of both query engines and generate surface insights through Superset. To serve other computing requirements, Lyft also maintains several relational databases, including Postgres and MySQL.

Work involving data engineering and data science is done through Apache Airflow, for creating repeatable data engineering or data science that can be executed on Kubernetes. Airflow provides Lyft engineers and scientists with a large abstraction layer for integrating different components in a reusable manner, including Lambda functions from Amazon Web Services. Gao said that Airflow can be used in orchestrating different units through a repeatable directed acyclic graph (DAG). He added that Airflow can also be used in defining an SLA and dependency between various DAG rounds.

Lyft has also made significant investments in real-time data processing and uses Apache Kafka together with Apache Flink and Spark for its streaming services. The company first used Amazon Kinesis in building its real-time data infrastructure but migrated to Apache Kafka when it encountered scalability issues in Kinesis. Gao said that Kinesis performs inconsistently while Kafka’s performance is more reliable with a large number of clients.

The company has a service contract with Confluent for the provision of enterprise Kafka service. It was Confluent that helped popularize Kafka after it was spun out of LinkedIn. Presently, Confluent provides Kafka as a service on Amazon Web Services and other cloud-based systems. Gao said that Lyft uses Kafka extensively in moving real-time metrics or event data.