< Back to experience

Genentech

Data Engineer Intern

Jun 2024 - Sep 2024
South San Francisco, CA

Built large-scale data pipelines for data products and internal AI chatbot using RAG

PythonPySparkSQLSpark SQLSnowflakeAWS GlueAWS LambdaAWS S3AWS RedshiftAWS AthenaRetrieval-Augmented Generation (RAG)

Data Pipeline Development & Integration

Snowflake → AWS S3 Data Movement

  • Built AWS Glue jobs using PySpark and Spark SQL to extract data from Snowflake, transform it into required schemas, and load into Amazon S3
  • Implemented CSV-to-Parquet conversions for more efficient querying and storage
  • Automated ingestion processes, reducing manual steps and improving pipeline reliability

ETL Process Implementation

  • Applied ETL concepts — extracting raw SAP/Snowflake data, transforming it (column parsing, splitting, and formatting), and loading into target systems
  • Worked with data ingestion flows for ValueTrak billing and other IRIS-related data products

Retrieval-Augmented Generation (RAG) System

AspireGPT Application Development

  • Engineered a ChatGPT-like application that engaged SAP cloud data and Azure OpenAI to assist 100+ interns with updated SAP data flow and mapping requirements
  • Designed and implemented the RAG architecture to retrieve relevant SAP documentation and system information, then generate contextual responses using Azure OpenAI
  • Enhanced intern efficiency by providing instant access to complex SAP data systems, reducing time spent searching through documentation and improving understanding of new data flows

SQL Development & Data Modeling

Snowflake SQL Optimization

  • Wrote and optimized complex SQL statements to extract, join, and transform data from multiple Snowflake schemas
  • Designed queries to map raw SAP data to ValueTrak Billing Document specifications, handling null values and field mismatches
  • Verified query outputs with Oracle SQL developers to ensure consistency between systems

Data Warehouse Architecture

  • Applied fact & dimension table concepts for analytics, differentiating between transactional facts and descriptive dimensions
  • Used SQL string manipulation (SPLIT_PART, DATE_FORMAT, LPAD, etc.) to create business-ready columns such as calendar months/weeks

Machine Learning & System Integration

Dataiku ML Pipeline Training

  • Built a predictive model to detect fraudulent job postings, starting with comprehensive data cleansing
  • Applied Python scripts to handle missing values, generate derived columns, and prepare datasets for modeling
  • Implemented feature engineering techniques including column splitting, text standardization, and data validation

Enterprise System Integration

  • Collaborated with IT leads to compile metadata for partner systems and mapped interfaces for ASPIRE's SAP S/4HANA migration
  • Attended Palantir Foundry trainings to understand enterprise data integration capabilities

<Connect_With_Me>