Jan 22, 2024
No image
Highly Scalable System for DNA Analysis
Completed

Highly Scalable System for DNA Analysis

$50,000+
4-6 months
United States
2-5
view project
Service categories
Service Lines
Big Data
Software Development
Domain focus
Healthcare
Other
Programming language
Java
Perl
Frameworks
Hadoop
Subcategories
Big Data
Data Analytics
Data Migration
Data Visualization

Challenge

Since the existing system was already operating on the superior hardware, vertical scaling was no longer an option. The team was also challenged to identify the legacy code parts that would allow for parallel processing of DNA samples with Hadoop. Finally, the system should have been seamlessly migrated to production.

Solution

At the client's U.S. offices, our Java engineers seamlessly integrated Cloudera CDH 5.2 for distributed data storage and computation. Utilizing Cloudera Manager, they enabled cluster monitoring and profiling. Creating a mini-framework based on MapReduce jobs, our team employed custom partitioners for efficient data distribution. A specialized converter transformed binary variant files into the Hadoop sequence format, compatible with the HDFS file system. Furthermore, we devised a reference architecture for an enhanced reporting solution, seamlessly integrated with Apache Spark. Leveraging Spark SQL, our experts preserved the existing SQL-based reporting module, ensuring easy adaptation to varied data sources.

Results

Altoros has delivered a highly scalable analytical system for de-duplication of genome samples - as a part of the customer’s analytical platform. Thousands of hospitals and laboratories worldwide use the system to detect DNA mutations, saving thousands of lives. The analysis takes minutes now, not hours; it allows for processing 10x more genome samples compared to the performance of the legacy system. Altoros’s engineers have also proposed a reference architecture for updating a reporting solution. Inspired by our recommendations, the customer went on improving the system with open-source data analytics technologies, which will eventually allow for saving thousands of dollars on expensive Oracle BI licenses.