Loading…

Radiant Earth Foundation is hosting an international expert workshop to discuss how best to use machine learning (ML) techniques on NASA’s Earth Observation (EO) data and address environmental challenges. In particular, generation and usage of training datasets for ML applications using EO will be discussed. Participants of the workshop will evaluate recent advancements, identify existing obstacles and develop a best practices guideline to enhance the adoption of these techniques.   

This workshop is sponsored by the NASA Earth Science Data Systems (ESDS) program.

Log in to bookmark your favorites and sync them to your phone or calendar.

Tuesday, January 21
 

8:00am

Registration and Meet & Greet
Join us bright and early to grab something to eat, a coffee or tea, get your name badge, and network with new colleagues and old friends!
  • If you are staying at the Cosmos Club, please note that your rate includes breakfast. We ask that you take advantage of that complimentary breakfast before heading to the Crentz Room (the main meeting room.)

Tuesday January 21, 2020 8:00am - 9:00am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

9:00am

Welcome & Introductions
Welcome remarks, participant introductions, and workshop logistics.

Speakers
avatar for Anne Miglarese

Anne Miglarese

CEO, Radiant Earth Foundation
Anne Hale Miglarese is the Founder and CEO of Radiant.Earth, a non-profit organization working to aggregate the world’s open Earth imagery and providing access and education on its use to the global development community. Prior to launching Radiant.Earth, Ms Miglarese served as... Read More →
avatar for Kevin Murphy

Kevin Murphy

Programme Executive, ESDS Program, NASA
Kevin Murphy is the Program Executive for Earth Science Data Systems within NASA’s Earth Science Division.


Tuesday January 21, 2020 9:00am - 9:20am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

9:20am

Morning Plenary - Presentations (1)
Workshop introductions, objectives and logistics. 

Speakers
avatar for Manil Maskey

Manil Maskey

Program Officer, NASA HQ
avatar for Hamed Alemohammad

Hamed Alemohammad

Chief Data Scientist, Radiant Earth Foundation
avatar for Subit Chakrabarti

Subit Chakrabarti

Data Scientist, Indigo Agriculture
I develop novel spatio-temporal machine learning methods applicable for large-scale earth imagery. I work in the Geoinnovation team at Indigo Agriculture and am the chair of Young Professionals activities in the IEEE Geoscience and Remote Sensing Society.
avatar for Lyndon Estes

Lyndon Estes

Assistant Professor, Clark University
avatar for Pierre Gentine

Pierre Gentine

Associate Professor, Columbia University


Tuesday January 21, 2020 9:20am - 10:30am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

10:30am

Break
Tuesday January 21, 2020 10:30am - 11:00am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

11:00am

Lightning Presentations (1)
Speakers
avatar for Dalton Lunga

Dalton Lunga

Lead Machine Learning Scientist, Oak Ridge National Laboratory
avatar for May Casterline

May Casterline

Senior Data Scientist, NVIDIA
Dr. May Casterline is a data scientist/image scientist/software developer with a background in satellite and airborne imaging systems. Her research interests include deep learning, hyperspectral and multispectral imaging, innovative applications of machine learning approaches to remote... Read More →
avatar for Ilkay Altintas

Ilkay Altintas

Chief Data Science Officer, UC San Diego, San Diego Supercomputer Center
avatar for Andi Gros

Andi Gros

Data Scientist, Facebook
I am a scientist with a background in complex systems on Facebook’s core data science team. I currently work on machine learning problems to tackle large quantities of textual data (topic modeling). I also work on spatial demographic questions for internet.org. For my PhD I studied... Read More →
SB

Steven Brumby

Senior Director Geographic Visualization, National Geographic Society
avatar for Zhuang-Fang Yi

Zhuang-Fang Yi

Machine Learning Engineer, Development Seed
avatar for Gencer Sumbul

Gencer Sumbul

Research Associate & PhD Candidate, Technische Universität Berlin
DH

Daniel Hogan

Data Scientist, IQT CosmiQ Works


Tuesday January 21, 2020 11:00am - 12:30pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

12:30pm

Lunch
Tuesday January 21, 2020 12:30pm - 1:30pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

1:30pm

Lightning Presentations (2)
Speakers
avatar for Ryan Abernathey

Ryan Abernathey

Associate Professor, Columbia University
Ryan Abernathey is a physical oceanographer and assistant professor of Earth and Environmental Sciences.
avatar for Joe Hamman

Joe Hamman

Scientist, NCAR
AS

Anu Swatantran

Remote Sensing Data Science R&D Lead, Corteva Agriscience
avatar for Sherrie Wang

Sherrie Wang

PhD Student, Stanford University
Machine learning + food security
avatar for Hannah Kerner

Hannah Kerner

Assistant Research Professor, NASA Harvest/University of Maryland
OD

Olha Danylo

Research Scholar, International Institute for Applied Systems Analysis


Tuesday January 21, 2020 1:30pm - 3:00pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

3:00pm

Break
Tuesday January 21, 2020 3:00pm - 3:30pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

3:30pm

Breakout Sessions Introductions
Speakers
avatar for Hamed Alemohammad

Hamed Alemohammad

Chief Data Scientist, Radiant Earth Foundation


Tuesday January 21, 2020 3:30pm - 4:00pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

4:00pm

Breakout Session 1 (WG 1): Training data generation and accounting for errors/uncertainties
Proposed Questions:
  1. How to define, measure, and document uncertainty in training data?
  2. How to treat sparsity in training data? If the data lacks spatial or temporal completeness, what are the ways to augment the data?
  3. How to decide on the size of training data required for a problem? This question cannot be answered in advance of building a model. But what steps should be followed to understand if the sample size is reasonable?
  4. How to understand and quantify representativeness and class balance/imbalance of training data? What are the metrics to assess geographical diversity and representativeness of training data?
  5. What are the requirements for compiling “benchmark training datasets” to advance model developments in each science discipline? For example, if the community is building a pollution estimation model, which is going to be integrated into a larger climate model and be integrated with CMIP comparisons, how can we benchmark the training data for this model and progressively improve it over time?
  6. How do we deal with class imbalance issues in training data for Earth science machine learning classifications?
  7. From the Earth Science Data Systems (sponsor) perspective how can we leverage the wealth of data for training models? E.g. sometimes multiple data sources can be fused to create a labeled dataset without a manual process.
  8. What are the recommendations to map gaps in training data catalogs (based on science discipline or application area)?

Moderator(s)
avatar for Lyndon Estes

Lyndon Estes

Assistant Professor, Clark University


Tuesday January 21, 2020 4:00pm - 5:00pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

4:00pm

Breakout Session 1 (WG2): Modeling approaches and best practices for building the best and computationally optimum model
Proposed Questions:
  1. Unlike common ML problems, Earth systems have a probabilistic nature, therefore we need to incorporate this property in our modeling frameworks and develop uncertainty aware models. What are the latest advancements in this direction, and what are the recommendations for new research ideas in this domain?
  2. Multispectral data has more than 3 bands (unlike typical computer vision data) and has a temporal dimension which is very informative for many modeling efforts. What are the best practices in incorporating these data features into ML modeling?
  3. Projections, spatial grids, and temporal revisits are almost never the same across two datasets. How can we as a community enable better interoperability between these data (that sometimes need to be input to a model)? And what are the considerations in building models when doing re-projection or spatial/temporal interpolation?
  4. What are the considerations for model development when transitioning from R&D to production: 
    1. How frequently should we re-train models?
    2. Content drift?
    3. Back-testing, Now-testing?
  5. How can ML models be used to improve specifics of measurement instruments? What are the possibilities for using existing training data and modeling frameworks to inform new data collection strategies? Could that lead to improved accounting of uncertainties in ML applications?

Moderator(s)
avatar for Dalton Lunga

Dalton Lunga

Lead Machine Learning Scientist, Oak Ridge National Laboratory


Tuesday January 21, 2020 4:00pm - 5:00pm
Cosmos Club (Board Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

4:00pm

Breakout Session 1 (WG3): Best practices for sharing and publishing ML applications (model, training data, and results)
Proposed Questions:
  1. What are the requirements and best practices for documenting and dissemination of training datasets to ensure reproducibility?
  2. What are the resources/repositories that researchers and practitioners can use to register and document training data? This should enable them to assign a DOI to the dataset.
  3. What are the recommended data formats for training data? e.g. traditionally earth system data are stored in NetCDF format, but they are not cloud-friendly and ML-ready.
  4. How to share processed input data? In many cases, the input data are pre-processed and then used as features in developing an ML model. These data sometimes come from huge datasets like re-analysis products. What are the requirements to share the training data in these scenarios to not duplicate the storage of the input data but ensure reproducibility?
  5. What are the tools (if any) for discovering and accessing training data on the cloud environments? If not enough, what are the requirements for such a tool?

Moderator(s)
avatar for Hamed Alemohammad

Hamed Alemohammad

Chief Data Scientist, Radiant Earth Foundation


Tuesday January 21, 2020 4:00pm - 5:00pm
Cosmos Club (Taft Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

5:00pm

Reception
Tuesday January 21, 2020 5:00pm - 8:00pm
Members Dining Room 2121 Massachusetts Ave NW, Washington, DC 20008, USA
 
Wednesday, January 22
 

8:00am

Meet & Greet
Join us bright and early to grab something to eat, a coffee or tea, get your name badge, and network with new colleagues and old friends!
  • If you are staying at the Cosmos Club, please note that your rate includes breakfast. We ask that you take advantage of that complimentary breakfast before heading to the Crentz Room (the main meeting room.)

Wednesday January 22, 2020 8:00am - 9:00am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

9:00am

Report of Day 1
Speakers
avatar for Hamed Alemohammad

Hamed Alemohammad

Chief Data Scientist, Radiant Earth Foundation
avatar for Lyndon Estes

Lyndon Estes

Assistant Professor, Clark University
avatar for Dalton Lunga

Dalton Lunga

Lead Machine Learning Scientist, Oak Ridge National Laboratory


Wednesday January 22, 2020 9:00am - 9:30am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

9:30am

Lightning Presentations (3)
Participants presentations.


Speakers
PG

Patrick Grimont

Head of Copernicus GS services, ESA
avatar for Joe Flasher

Joe Flasher

Open Geospatial Data Lead, Amazon Web Services
Joe Flasher is the Open Geospatial Data Lead at Amazon Web Services helping organizations most effectively make data available for analysis in the cloud. The AWS open data program has democratized access to petabytes of data, including satellite imagery, genomic data, and data used... Read More →
avatar for Alex Leith

Alex Leith

Assistant Director, Geoscience Australia
PD

Peter Doucette

Associate Program Coordinator, U.S. Geological Survey
BG

Benjamin Goldenberg

Engineering Manager, ML Infrastructure, Planet
avatar for Richard Choularton

Richard Choularton

Director, Agriculture and Economic Growth, Tetra Tech


Wednesday January 22, 2020 9:30am - 10:30am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

10:30am

Break
Wednesday January 22, 2020 10:30am - 11:00am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

11:00am

Lightning Presentations (4)
Participants presentations.

Speakers
avatar for Murali Gumma

Murali Gumma

Head – Remote Sensing / GIS Lab, ISD, ICRISAT
Murali Gumma leads the RS-GIS unit, and conducts research in spatial aspects of agricultural research providing spatial dimension to almost all the components of agriculture. Prior to this he was a Remote Sensing Specialist and Post-doctoral fellow at IRRI. Before that he was a project... Read More →
TA

Tasso Azevedo

General Coordinator, MapBiomas
CB

Christopher Brown

Engineer, Google
BG

Brookie Guzder-Williams

Director of Data Science, World Resources Institute
avatar for Konrad Wessels

Konrad Wessels

Assistant Professor, George Mason University
Remote Sensing expert.
avatar for Caleb Robinson

Caleb Robinson

PhD Student, Georgia Institute of Technology
avatar for Lewis Fishgold

Lewis Fishgold

Software Engineer, Azavea
I am a software engineer at Azavea where I build tools for applying machine learning to geospatial imagery such as Raster Vision (https://github.com/azavea/raster-vision), and develop custom solutions using these tools for a variety of clients.


Wednesday January 22, 2020 11:00am - 12:30pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

12:30pm

Lunch
Wednesday January 22, 2020 12:30pm - 1:30pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

1:30pm

Breakout Session 2 (WG1): Training data generation and accounting for errors/uncertainties
Proposed Questions:
  1. How to define, measure, and document uncertainty in training data?
  2. How to treat sparsity in training data? If the data lacks spatial or temporal completeness, what are the ways to augment the data?
  3. How to decide on the size of training data required for a problem? This question cannot be answered in advance of building a model. But what steps should be followed to understand if the sample size is reasonable?
  4. How to understand and quantify representativeness and class balance/imbalance of training data? What are the metrics to assess geographical diversity and representativeness of training data?
  5. What are the requirements for compiling “benchmark training datasets” to advance model developments in each science discipline? For example, if the community is building a pollution estimation model, which is going to be integrated into a larger climate model and be integrated with CMIP comparisons, how can we benchmark the training data for this model and progressively improve it over time?
  6. How do we deal with class imbalance issues in training data for Earth science machine learning classifications?
  7. From the Earth Science Data Systems (sponsor) perspective how can we leverage the wealth of data for training models? E.g. sometimes multiple data sources can be fused to create a labeled dataset without a manual process.
  8. What are the recommendations to map gaps in training data catalogs (based on science discipline or application area)?

Moderator(s)
avatar for Lyndon Estes

Lyndon Estes

Assistant Professor, Clark University


Wednesday January 22, 2020 1:30pm - 3:00pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

1:30pm

Breakout Session 2 (WG2): Modeling approaches and best practices for building the best and computationally optimum model
Proposed Questions:
  1. Unlike common ML problems, Earth systems have a probabilistic nature, therefore we need to incorporate this property in our modeling frameworks and develop uncertainty aware models. What are the latest advancements in this direction, and what are the recommendations for new research ideas in this domain?
  2. Multispectral data has more than 3 bands (unlike typical computer vision data) and has a temporal dimension which is very informative for many modeling efforts. What are the best practices in incorporating these data features into ML modeling?
  3. Projections, spatial grids, and temporal revisits are almost never the same across two datasets. How can we as a community enable better interoperability between these data (that sometimes need to be input to a model)? And what are the considerations in building models when doing re-projection or spatial/temporal interpolation?
  4. What are the considerations for model development when transitioning from R&D to production: 
    1. How frequently should we re-train models?
    2. Content drift?
    3. Back-testing, Now-testing?
  5. How can ML models be used to improve specifics of measurement instruments? What are the possibilities for using existing training data and modeling frameworks to inform new data collection strategies? Could that lead to improved accounting of uncertainties in ML applications?

Moderator(s)
avatar for Dalton Lunga

Dalton Lunga

Lead Machine Learning Scientist, Oak Ridge National Laboratory


Wednesday January 22, 2020 1:30pm - 3:00pm
Cosmos Club (Board Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

1:30pm

Breakout Session 2 (WG3): Best practices for sharing and publishing ML applications (model, training data, and results)
Proposed Questions:
  1. What are the requirements and best practices for documenting and dissemination of training datasets to ensure reproducibility?
  2. What are the resources/repositories that researchers and practitioners can use to register and document training data? This should enable them to assign a DOI to the dataset.
  3. What are the recommended data formats for training data? e.g. traditionally earth system data are stored in NetCDF format, but they are not cloud-friendly and ML-ready.
  4. How to share processed input data? In many cases, the input data are pre-processed and then used as features in developing an ML model. These data sometimes come from huge datasets like re-analysis products. What are the requirements to share the training data in these scenarios to not duplicate the storage of the input data but ensure reproducibility?
  5. What are the tools (if any) for discovering and accessing training data on the cloud environments? If not enough, what are the requirements for such a tool?

Moderator(s)
avatar for Hamed Alemohammad

Hamed Alemohammad

Chief Data Scientist, Radiant Earth Foundation


Wednesday January 22, 2020 1:30pm - 3:00pm
Cosmos Club (Taft Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

3:00pm

Break
Wednesday January 22, 2020 3:00pm - 3:30pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

3:30pm

Breakout Sessions Reporting
Speakers
avatar for Hamed Alemohammad

Hamed Alemohammad

Chief Data Scientist, Radiant Earth Foundation
avatar for Lyndon Estes

Lyndon Estes

Assistant Professor, Clark University
avatar for Dalton Lunga

Dalton Lunga

Lead Machine Learning Scientist, Oak Ridge National Laboratory


Wednesday January 22, 2020 3:30pm - 4:00pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

4:00pm

Breakout Session 3 (WG 1): Training data generation and accounting for errors/uncertainties
Proposed Questions:
  1. How to define, measure, and document uncertainty in training data?
  2. How to treat sparsity in training data? If the data lacks spatial or temporal completeness, what are the ways to augment the data?
  3. How to decide on the size of training data required for a problem? This question cannot be answered in advance of building a model. But what steps should be followed to understand if the sample size is reasonable?
  4. How to understand and quantify representativeness and class balance/imbalance of training data? What are the metrics to assess geographical diversity and representativeness of training data?
  5. What are the requirements for compiling “benchmark training datasets” to advance model developments in each science discipline? For example, if the community is building a pollution estimation model, which is going to be integrated into a larger climate model and be integrated with CMIP comparisons, how can we benchmark the training data for this model and progressively improve it over time?
  6. How do we deal with class imbalance issues in training data for Earth science machine learning classifications?
  7. From the Earth Science Data Systems (sponsor) perspective how can we leverage the wealth of data for training models? E.g. sometimes multiple data sources can be fused to create a labeled dataset without a manual process.
  8. What are the recommendations to map gaps in training data catalogs (based on science discipline or application area)?

Moderator(s)
avatar for Lyndon Estes

Lyndon Estes

Assistant Professor, Clark University


Wednesday January 22, 2020 4:00pm - 5:00pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

4:00pm

Breakout Session 3 (WG2): Modeling approaches and best practices for building the best and computationally optimum model
Proposed Questions:
  1. Unlike common ML problems, Earth systems have a probabilistic nature, therefore we need to incorporate this property in our modeling frameworks and develop uncertainty aware models. What are the latest advancements in this direction, and what are the recommendations for new research ideas in this domain?
  2. Multispectral data has more than 3 bands (unlike typical computer vision data) and has a temporal dimension which is very informative for many modeling efforts. What are the best practices in incorporating these data features into ML modeling?
  3. Projections, spatial grids, and temporal revisits are almost never the same across two datasets. How can we as a community enable better interoperability between these data (that sometimes need to be input to a model)? And what are the considerations in building models when doing re-projection or spatial/temporal interpolation?
  4. What are the considerations for model development when transitioning from R&D to production: 
    1. How frequently should we re-train models?
    2. Content drift?
    3. Back-testing, Now-testing?
  5. How can ML models be used to improve specifics of measurement instruments? What are the possibilities for using existing training data and modeling frameworks to inform new data collection strategies? Could that lead to improved accounting of uncertainties in ML applications?

Moderator(s)
avatar for Dalton Lunga

Dalton Lunga

Lead Machine Learning Scientist, Oak Ridge National Laboratory


Wednesday January 22, 2020 4:00pm - 5:00pm
Cosmos Club (Board Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

4:00pm

Breakout Session 3 (WG3): Best practices for sharing and publishing ML applications (model, training data, and results)
Proposed Questions:
  1. What are the requirements and best practices for documenting and dissemination of training datasets to ensure reproducibility?
  2. What are the resources/repositories that researchers and practitioners can use to register and document training data? This should enable them to assign a DOI to the dataset.
  3. What are the recommended data formats for training data? e.g. traditionally earth system data are stored in NetCDF format, but they are not cloud-friendly and ML-ready.
  4. How to share processed input data? In many cases, the input data are pre-processed and then used as features in developing an ML model. These data sometimes come from huge datasets like re-analysis products. What are the requirements to share the training data in these scenarios to not duplicate the storage of the input data but ensure reproducibility?
  5. What are the tools (if any) for discovering and accessing training data on the cloud environments? If not enough, what are the requirements for such a tool?

Moderator(s)
avatar for Hamed Alemohammad

Hamed Alemohammad

Chief Data Scientist, Radiant Earth Foundation


Wednesday January 22, 2020 4:00pm - 5:00pm
Cosmos Club (Taft Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA
 
Thursday, January 23
 

8:00am

Meet & Greet
Join us bright and early to grab something to eat, a coffee or tea, get your name badge, and network with new colleagues and old friends!
  • If you are staying at the Cosmos Club, please note that your rate includes breakfast. We ask that you take advantage of that complimentary breakfast before heading to Crentz Room (the main meeting room.)

Thursday January 23, 2020 8:00am - 9:00am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

9:00am

Report of Day 2
Thursday January 23, 2020 9:00am - 9:30am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

9:30am

Breakout Session 4 (WG 1): Training data generation and accounting for errors/uncertainties
Proposed Questions:
  1. How to define, measure, and document uncertainty in training data?
  2. How to treat sparsity in training data? If the data lacks spatial or temporal completeness, what are the ways to augment the data?
  3. How to decide on the size of training data required for a problem? This question cannot be answered in advance of building a model. But what steps should be followed to understand if the sample size is reasonable?
  4. How to understand and quantify representativeness and class balance/imbalance of training data? What are the metrics to assess geographical diversity and representativeness of training data?
  5. What are the requirements for compiling “benchmark training datasets” to advance model developments in each science discipline? For example, if the community is building a pollution estimation model, which is going to be integrated into a larger climate model and be integrated with CMIP comparisons, how can we benchmark the training data for this model and progressively improve it over time?
  6. How do we deal with class imbalance issues in training data for Earth science machine learning classifications?
  7. From the Earth Science Data Systems (sponsor) perspective how can we leverage the wealth of data for training models? E.g. sometimes multiple data sources can be fused to create a labeled dataset without a manual process.
  8. What are the recommendations to map gaps in training data catalogs (based on science discipline or application area)?

Moderator(s)
avatar for Lyndon Estes

Lyndon Estes

Assistant Professor, Clark University


Thursday January 23, 2020 9:30am - 11:00am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

9:30am

Breakout Session 4 (WG2): Modeling approaches and best practices for building the best and computationally optimum model
Proposed Questions:
  1. Unlike common ML problems, Earth systems have a probabilistic nature, therefore we need to incorporate this property in our modeling frameworks and develop uncertainty aware models. What are the latest advancements in this direction, and what are the recommendations for new research ideas in this domain?
  2. Multispectral data has more than 3 bands (unlike typical computer vision data) and has a temporal dimension which is very informative for many modeling efforts. What are the best practices in incorporating these data features into ML modeling?
  3. Projections, spatial grids, and temporal revisits are almost never the same across two datasets. How can we as a community enable better interoperability between these data (that sometimes need to be input to a model)? And what are the considerations in building models when doing re-projection or spatial/temporal interpolation?
  4. What are the considerations for model development when transitioning from R&D to production: 
    1. How frequently should we re-train models?
    2. Content drift?
    3. Back-testing, Now-testing?
  5. How can ML models be used to improve specifics of measurement instruments? What are the possibilities for using existing training data and modeling frameworks to inform new data collection strategies? Could that lead to improved accounting of uncertainties in ML applications?

Moderator(s)
avatar for Dalton Lunga

Dalton Lunga

Lead Machine Learning Scientist, Oak Ridge National Laboratory


Thursday January 23, 2020 9:30am - 11:00am
Cosmos Club (Board Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

9:30am

Breakout Session 4 (WG3): Best practices for sharing and publishing ML applications (model, training data, and results)
Proposed Questions:
  1. What are the requirements and best practices for documenting and dissemination of training datasets to ensure reproducibility?
  2. What are the resources/repositories that researchers and practitioners can use to register and document training data? This should enable them to assign a DOI to the dataset.
  3. What are the recommended data formats for training data? e.g. traditionally earth system data are stored in NetCDF format, but they are not cloud-friendly and ML-ready.
  4. How to share processed input data? In many cases, the input data are pre-processed and then used as features in developing an ML model. These data sometimes come from huge datasets like re-analysis products. What are the requirements to share the training data in these scenarios to not duplicate the storage of the input data but ensure reproducibility?
  5. What are the tools (if any) for discovering and accessing training data on the cloud environments? If not enough, what are the requirements for such a tool?

Moderator(s)
avatar for Hamed Alemohammad

Hamed Alemohammad

Chief Data Scientist, Radiant Earth Foundation


Thursday January 23, 2020 9:30am - 11:00am
Cosmos Club (Taft Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

11:00am

Break
Thursday January 23, 2020 11:00am - 11:30am
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

12:00pm

Wrap up & Adjourn
Thursday January 23, 2020 12:00pm - 12:30pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA

12:30pm

Lunch
Thursday January 23, 2020 12:30pm - 1:30pm
Cosmos Club (Crentz Room) 2121 Massachusetts Ave NW, Washington, DC 20008, USA