I’m crunching numbers and designing experiments to fight disease and improve lives.

Data scientist with a Molecular Biology PhD. I have practical experience and a thorough understanding of big data analytics, from identifying the right question, to experimental design and data sampling approaches, to applying the appropriate statistical and machine learning method for the task-at-hand. I am passionate about proper data usage for avoiding biases and the applications of data science in healthcare and public health.


Years of Experience


Prizes and Awards


Podcasts and invited talks


Peer-reviewed articles

My projects

  • CDS sepsis

    An AI App for value-based solutions in healthcare

    With the digitization of primary healthcare information almost complete, many hospitals are looking to leverage the investment and maximize the returns of this gargantuan effort that took place over the past two decades. Supporting clinicians in the diagnosis of sepsis through AI-backed clinical decision support tools makes value-based care a reality in the intensive care unit.

    Using MIMIC II data containing patient demographics and their life signs, I leverage machine learning interpretability to offer actionable insights to the ICU specialist. The app was built using's Wave App platform and the design aims to avoid alert fatigue. Watch the free webinar where I explain how the app was build here (free registration).

  • Plan(t)Wise

    Planning a green canopy over the big apple

    City trees only survive for a small fraction of their natural lifespan in the wild. As a lot of effort and money is spend annually by the City of New York to maintain its urban forest and improve the quality of life for New Yorkers, my app advises on the trees with the best survival record in any NYC address entered by the user. Plan(t)wise was my personal project during the summer of 2016, soon after the most recent NYC Street Tree census dataset was released. Data preprocessing and analysis were done using Python and R. The web app is hosted at AWS, and is powered by Flask. My goal was to create an algorithm that determines—based on historical data—which tree species and varieties have the best survival record in specific New York city locations. In the process of doing this, I developed an innovative algorithm for matching trees across NYC Tree Censuses-a task previously hindered by the low GPS resolution of the specific datasets. The project won the first data science prize in the inaugural NYC Open Data contest (2018). Why don’t you give the app a go with a NYC address of your choice at the plan-t-wise page?

  • Complex PenoGeno

    An exercise on multiple data integration for personalized diagnosis

    Unlike illnesses such as sickle cell anemia or cystic fibrosis, where a single gene is responsible entirely for the disease, diseases such as cancer, cardiovascular conditions, and obesity depend on the interaction of multiple defective (or disease-enabling) genes and the environment a person is exposed to. These "complex" diseases are actually the majority of non-infectious disorders and a topic of intense research. The reason they are so hard to predict and prevent complex diseases via personalized medicine approaches is their dependence on a multitude of individual causes that act additively; different genetic mutations affect different biological pathways and all together contribute to the onset and severity of the disease. Moreover, certain environmental and behavioral factors (e.g. radiation exposure, smoking, reduced vitamin intake) might further contribute to disrupting these pathways, deteriorating the observed adverse outcomes, even when the mutational burden for the disorder is not too high. Because multiple pathways are usually involved in complex disease, the symptoms are also, typically, complex, comprising of a collection of distinct phenotypes that are being monitored and used in diagnosis.

    During a short fellowship at the National Institutes of Health (Bethesda, MD), I developed Complex PhenoGeno, a machile learning pipeline prototype, suggesting a way to combine genetic information (patient-specific and literature-derived), together with clinical tests (electronic health records data) and patient behavioral information to predict disease outcomes. What the pipeline now needs is patients and healthy individuals for which we have both biological and extensive behavioral knowledge in order to fine-tune the model. This is the hard part.

  • The state of obesity in the US

    Behavioral and clinical data approaches to predict obesity incidence

    Obesity has been the focus of intense research in recent years. Advances have been made in understanding their genetic and physiological basis, as well as in identifying social risk factors. Interestingly, the standard calorie intake equilibrium theory has not been able to completely explain poor weight loss outcomes. In the population level, ecological and epidemiological models are well established in describing the risk factors and spread of obesity. Among the ecological parameters that are important predictors of obesity at the population level are geography, socioeconomic status, and ethnic background. Since the Surgeon General's "Call to Action to Prevent and Decrease Overweight and Obesity" in 2001 and subsequent initiative announcements, more than 2,500 policies encouraging healthy lifestyles have been enacted at the state level. Obesity nevertheless is still on the rise. I created an interactive Shiny visualization describing the problem that can be accessed through here.

    My work at New York University had two goals: (1) Use Electronic Health Records (EHR) and insurance claims data to predict (and prevent) childhood obesity (2) Develop methodology for the near-real-time screening of unhealthy behaviors at the population level with the intention to enable better targeting of meaningful interventions.