Starburst today rolled out a host of enhancements to its Trino-based analytics platform for the cloud, called Galaxy, including support for Python, new caching and indexing features, and a new data catalog. The company unveiled the new features as it kicks off its two-day Datanova conference, which takes place online.
Starburst develops and sells two big data analytics offerings, including Starburst Galaxy, the cloud-based service launched about 14 months ago, as well as Starburst Enterprise, a more established on-prem offering. Both are based on Trino, the Presto variant originally developed by Facebook as the faster successor to Apache Hive.
Many enterprises operating data lakes and lake houses have already invested in data catalogs, but some have not. For those who have not, Starburst now offers data tracking capabilities that are built-in to Galaxy.
The new catalog enables Galaxy users to discover new files stored across any lake, including AWS S3, Azure Data Lake Service (ADLS), and Google Cloud Storage (GCS), says Vishal Singh, head of data products at Starburst.
“When the files are found, those files get automatically indexed and cataloged, so there’s no extra work needs to be done to catalog those files,” Singh says.
The catalog generates metadata that helps make data analysts more productive in their data lakes more quickly. In addition to indexing the files and tracking ownership of files, the new catalog will automatically collect data such as top users and most popular tables, which can help inform usage data and schema design.
“All those are automatically being generated behind the scenes,” Singh says. “We are the query engine, so all the information from queries to logging to auditability to privileges–everything is actually getting attached to the tables.”
The addition of Python support will help both analysts and data scientists, Starburst says. Instead of having to rewrite ETL scripts in SQL, Galaxy users will be able to import their existing ETL scripts, and the Starburst environment will automatically convert them to the SQL under the covers, Singh says.
“It enables somebody to not rewrite the code, but actually switch the code from one end point to another endpoint, and use the flexibility of Starburst itself,” he says. “I’ve already written the code. All I’m doing is changing the end point.”
Data scientists will also use the new Python feature (which is in private preview) to power data exploration for scientists, he says.
Lastly, Galaxy gains a preview of Warp Speed, which is a series of new indexing and caching features that are already available in Starburst Enterprise. Warp Speed can accelerate queries in Galaxy by up to 7x, says Ali Huselid, senior vice president of product for Starburst.
“It’s an indexing piece and a caching piece,” she says. “So you’re indexing the right data yet you really making those decisions again based on what the pattern of user queries are.”
Warp Speed will be most applicable for things like dashboards, where there is some repeatability to the queries, Huselid says. “Dashboards are one example of something that tends to be very repeatedly executed, and so they’ll be optimized for that,” she says.
Warp Speed is based on work conducted by Varada, the Isreali Trino startup acquired by Starburst last year.
Today marked the first day of Datanova, Starburst’s free, virtual event. Starburst has more than 20 sessions over the two day even, including presntations by Starburst CEO Justin Borgman, ThoughtSpot CSO Cindi Howson, Nextdata founder Zhamak Dehghani, journalist Kara Swisher, and others. You can still register to attend.
Are Databases Becoming Just Query Engines for Big Object Stores?
Starburst Acquires Fellow Trino Supplier, Varada
Starburst Backs Data Mesh Architecture
big data, data catalog, data lake, ETL, federated query, Justin Borgman, lakehouse, presto, python, Trino, Warp Speed