Genentech
Data Engineer Intern
Jun 2024 - Sep 2024
South San Francisco, CA
Built large-scale data pipelines for data products and internal AI chatbot using RAG
PythonPySparkSQLSpark SQLSnowflakeAWS GlueAWS LambdaAWS S3AWS RedshiftAWS AthenaRetrieval-Augmented Generation (RAG)
Data Pipeline Development & Integration
Snowflake → AWS S3 Data Movement
- Built AWS Glue jobs using PySpark and Spark SQL to extract data from Snowflake, transform it into required schemas, and load into Amazon S3
- Implemented CSV-to-Parquet conversions for more efficient querying and storage
- Automated ingestion processes, reducing manual steps and improving pipeline reliability
ETL Process Implementation
- Applied ETL concepts — extracting raw SAP/Snowflake data, transforming it (column parsing, splitting, and formatting), and loading into target systems
- Worked with data ingestion flows for ValueTrak billing and other IRIS-related data products
Retrieval-Augmented Generation (RAG) System
AspireGPT Application Development
- Engineered a ChatGPT-like application that engaged SAP cloud data and Azure OpenAI to assist 100+ interns with updated SAP data flow and mapping requirements
- Designed and implemented the RAG architecture to retrieve relevant SAP documentation and system information, then generate contextual responses using Azure OpenAI
- Enhanced intern efficiency by providing instant access to complex SAP data systems, reducing time spent searching through documentation and improving understanding of new data flows
SQL Development & Data Modeling
Snowflake SQL Optimization
- Wrote and optimized complex SQL statements to extract, join, and transform data from multiple Snowflake schemas
- Designed queries to map raw SAP data to ValueTrak Billing Document specifications, handling null values and field mismatches
- Verified query outputs with Oracle SQL developers to ensure consistency between systems
Data Warehouse Architecture
- Applied fact & dimension table concepts for analytics, differentiating between transactional facts and descriptive dimensions
- Used SQL string manipulation (SPLIT_PART, DATE_FORMAT, LPAD, etc.) to create business-ready columns such as calendar months/weeks
Machine Learning & System Integration
Dataiku ML Pipeline Training
- Built a predictive model to detect fraudulent job postings, starting with comprehensive data cleansing
- Applied Python scripts to handle missing values, generate derived columns, and prepare datasets for modeling
- Implemented feature engineering techniques including column splitting, text standardization, and data validation
Enterprise System Integration
- Collaborated with IT leads to compile metadata for partner systems and mapped interfaces for ASPIRE's SAP S/4HANA migration
- Attended Palantir Foundry trainings to understand enterprise data integration capabilities