Accurate prediction of probabilistic and interactive human behavior is a prerequisite to enable full autonomy of mobile robots (e.g., autonomous vehicles) in complex scenes. To acquire accurate predictions, two fundamental problems should be addressed: 1) datasets of human behavior and motions in interactive tasks and scenarios, and 2) evaluation metrics and benchmarks for extensive prediction models/algorithms. Datasets are the most important asset since they provide sources for both model learning/training and validation. Similarly, evaluation metrics and benchmarks are also of fundamental importance since they provide not only criteria but also guidance for the design of prediction algorithms. Currently, the research community is still on its way to build high-quality datasets containing interactive human behavior, such as human-driven vehicles, pedestrians, cyclists, etc. Also, there is yet no widely accepted evaluation metric which can comprehensively quantify/evaluate the performance of different probabilistic prediction algorithms from perspectives of both data approximation and fatality/utility impacts on the autonomy of the mobile robots. In this workshop, we intend to invoke in-depth discussions on construction and utilization of dataset and benchmark on interactive human behavior to facilitate research efforts in the field of probabilistic prediction. Further and deeper comprehensions are expected to gain from the workshop on the following questions: •What kind of raw data, labels and features are expected to be contained in the datasets by researchers working on probabilistic prediction? •What kind of scenarios should be included in the datasets so that the behavior and motions of the human (or vehicles with human drivers) are highly interactive, versatile and representative? •What levels of quantity, diversity, complexity and criticality are expected for the scenes and behavior contained in the dataset? •What is the proper evaluation metrics in order to measure the data approximation performances of the prediction algorithms, as well as to reveal the consequences for fully autonomy (e.g., safety, efficiency) when the prediction algorithms are adopted? •How to compare the data-efficiency, interpretability and generalizability of different prediction methods? •What are the typical performances of different categories of prediction algorithms when evaluated from different perspectives, such as algorithms based on deep neural networks or probabilistic graphical models, as well as planning-based prediction methods such as inverse reinforcement learning? •What human behavior models should we adopt in prediction algorithms, and how to effectively incorporate prior knowledge regarding human behavior models into the design of prediction algorithms?