The CAMELS Multifield Data Set: Learning the Universe’s Fundamental Parameters with Artificial Intelligence

We present the Cosmology and Astrophysics with Machine Learning Simulations (CAMELS) Multifield Data set (CMD), a collection of hundreds of thousands of 2D maps and 3D grids containing many different properties of cosmic gas, dark matter, and stars from more than 2000 distinct simulated universes at several cosmic times. The 2D maps and 3D grids represent cosmic regions that span ∼100 million light-years and have been generated from thousands of state-of-the-art hydrodynamic and gravity-only N-body simulations from the CAMELS project. Designed to train machine-learning models, CMD is the largest data set of its kind containing more than 70 TB of data. In this paper we describe CMD in detail and outline a few of its applications. We focus our attention on one such task, parameter inference, formulating the problems we face as a challenge to the community. We release all data and provide further technical details at https://camels-multifield-dataset.readthedocs.io.


INTRODUCTION
In the era of precision cosmology, there is great interest in developing sophisticated techniques to optimally measure cosmological parameters from astrophysical datasets.To achieve the desired precision of nextgeneration experiments often requires accounting for complex astrophysical effects that can impact the properties of the luminous galaxies and gas from which cosmological parameters are inferred.
One of the main reasons behind this success is the fact that neural networks behave as universal function approximators (Hornik et al. 1990;Kurt & Hornik 1991;Cybenko 1989).Differently to many traditional methods, neural networks do not require the use of analytic expressions and are not limited to low-dimensional spaces.On the other hand, training neural networks usually requires very large datasets.While existing datasets for computer vision tasks are numerous and diverse (e.g.MNIST2 , CIFAR103 , and ImageNet4 ), the situation is very different for cosmology.
Large-volume cosmological hydrodynamic simulations have become a primary tool to study the formation of galaxies and large scale structure (Somerville & Davé 2015).These simulations follow explicitly the evolution of the dark matter and baryonic components in an expanding universe, incorporating a variety of subgrid prescriptions to model key physical processes such as star formation and feedback from massive stars and the impact of Active Galactic Nuclei (AGN) feedback powered by accretion onto supermassive black holes (e.g.Genel et al. 2014;Anglés-Alcázar et al. 2017a;Pillepich et al. 2018a;Davé et al. 2019).However, many of these processes remain poorly understood and require careful tuning of free parameters.While a single simulation can provide a plausible representation of the universe, model-dependent parameter degeneracies limit the predictive power of simulations and their use as primary tool to produce robust constraints on cosmological parameters.
The Cosmology and Astrophysics with MachinElearning Simulations (CAMELS) project (Villaescusa-Navarro et al. 2021) has pioneered a new approach, producing thousands of cosmological hydrodynamic simulations to train machine learning algorithms, varying cosmology, initial random field, subgrid galaxy formation model, and stellar and AGN feedback parameters.
In this paper we present and make publicly available the CAMELS Multifield Dataset, CMD, a large dataset containing hundreds of thousands of 2D maps and 3D grids with different properties obtained from 2,000 distinct universes.Each 2D map and 3D grid represents a region with an area of (25 h −1 Mpc) 2 and volume of (25 h −1 Mpc) 3 , respectively, where many different fields are included, from dark matter to gas and stellar properties, at different redshifts.Each 2D map and 3D grid is associated to a vector of either 2 or 6 numbers: two cosmological parameters (all data) and four astrophysical parameters (only data from hydrodynamic simulations) that define and control the behavior of the parent cosmological simulation in each case.The full dataset comprises more than 70 Terabytes and represents the largest collection of 2D maps and 3D grids from stateof-the-art hydrodynamic simulations publicly available to date, and may serve as a standard cosmological and astrophysical dataset to train machine learning models to perform a large variety of tasks.
In our companion papers (Villaescusa-Navarro & et al. 2021a,b) we used CMD to perform, for the first time, likelihood-free inference of cosmological parameters at the field level from 2D maps generated from state-of-theart hydrodynamic simulations of 13 different fields, obtaining very promising results.In this work we describe in detail the architecture and training procedure used in those works.Furthermore, we formulate the problems we encountered as a challenge to the community.We also release all codes and network weights from our companion papers (Villaescusa-Navarro & et al. 2021a,b) as a benchmark to compare against.
Information on how to download and manipulate CMD, together with the codes and network weights used for parameter inference and other tasks can be found at https://camels-multifield-dataset.readthedocs.io.
This paper is organized as follows.In section 2 we describe CMD in detail.We then outline in section 3 a few of its possible applications, focusing our attention on parameter inference.We conclude in section 4.

DATA
In this section we describe CMD.We first present CAMELS, the simulation suite used to generate CMD.We then outline the different fields present in CMD.Next, we explain how the 2D maps and 3D grids were created.Finally, we discuss the labels, symmetries, and storage needs of CMD.

The CAMEL Simulations
CMD was generated from CAMELS data (Villaescusa-Navarro et al. 2021).We now briefly describe those simulations here and refer the reader to Villaescusa-Navarro et al. (2021) for further details.
CAMELS is a suite of 4,233 numerical simulations: 2,184 state-of-the-art (magneto-)hydrodynamic simulations and 2,049 gravity-only N-body simulations.All simulations follow the evolution of 256 3 dark matter particles plus 256 3 fluid elements (only in the case of hydrodynamic simulations) from z = 127 to the present time, z = 0, in a periodic volume of (25 h −1 Mpc) 3 .The initial conditions of all simulations were generated at redshift z = 127 using second order Lagrangian perturbation theory (2LPT5 ).For simplicity, we assumed that the transfer functions of the dark matter and gas fluids of the (magneto-)hydrodynamic simulations were the same and equal to the one of total matter.For each simulation, we saved snapshots at 34 different redshifts, from z = 6 to z = 0. To generate CMD, we used the snapshots at z = [0, 0.5, 1, 1.5, 2].
While all simulations follow the evolution of dark matter particles, only the (magneto-)hydrodynamic simu-lations solve the hydrodynamic equations and implement astrophysical effects such as star formation and feedback from stars and AGN.Each CAMEL simulation belongs to one of three suites: 1) IllustrisTNG (magneto-hydrodynamic simulations), 2) SIMBA (hydrodynamic simulations), and 3) N-body (gravity-only simulations).CMD preserves this naming system.For instance, when addressing the IllustrisTNG neutral hydrogen maps, we refer to the 2D maps containing the neutral hydrogen field that were generated from the CAMELS IllustrisTNG simulation suite.We note that the IllustrisTNG and SIMBA simulations solve the hydrodynamic equations using two completely different methods and implement very different subgrid models.
The hydrodynamic simulations (IllustrisTNG and SIMBA) also vary four astrophysical parameters, coined A SN1 , A SN2 , A AGN1 , and A AGN2 that control the efficiency of supernova and AGN feedback.It is important to emphasize that although the names of these parameters are the same in IllustrisTNG and SIMBA, their implementation and meaning are not identical.This should be taken into account while working with the feedback parameter labels (we provide more details on this in section 2.5).
Each CAMEL simulation is thus characterized by the suite it belongs to (IllustrisTNG, SIMBA, or N-body) and either 2 or 6 numbers: two cosmological parameters (all simulations) and four astrophysical parameters (only IllustrisTNG and SIMBA).While the astrophysical parameters govern broadly similar quantities in Illus-trisTNG and SIMBA, they are implemented differently and hence their absolute values cannot be straightforwardly compared across models.Also, one cannot expect a one-to-one correspondence between the results of a simulation and the values of its parameters, as the simulations are also affected by cosmic variance: the finite volume that the simulations represent does not correspond to a representative sample of the whole universe.Thus, the link between quantities measured from a simulation and the values of its parameters can be considered as probabilistic.
The CAMELS simulations span a wide range in the values of the cosmological and astrophysical parameters.They were designed precisely in that way to avoid being affected by priors when training neural networks and also to allow overlap in parameter space when using different hydrodynamic simulations.On the one hand, we are mostly interested in performing inference on the value of the cosmological parameters, thus many different models need to be simulated.On the other hand, we know that astrophysical effects can have an impact on cosmological observables but we do not fully understand the physics of those effects, so the most conservative solution is to marginalize over them.For that reason, the values of the astrophysical parameters have to be varied over a wide range to perform a robust marginalization.
Each simulation suite contains four different sets: • LH (Latin-Hypercube) is a set of 1,000 simulations for each of the IllustrisTNG and SIMBA suites, where the values of the cosmological and astrophysical parameters are sampled from a latinhypercube.Each of these simulations has a different initial random seed.The vast majority of 2D maps and 3D grids from CMD were generated from this simulation set.This is the main simulation set from which we build CMD.
• 1P (1 parameter at a time) is a set of 61 simulations with the same initial random seed but where only the value of one parameter is varied at a time.CMD contains data from this set.
• CV (Cosmic Variance) is a set of 27 simulations with the same value of the cosmological and astrophysical parameters but different initial random seeds.A small fraction of 2D maps and 3D grids from CMD were generated from this set.The main purpose of the data from this set is to test the models trained on data from the LH set.
• EX (Extreme) is a set of 4 simulations covering extreme models.CMD does not contain data from this set.
We now describe in more detail the particulars of each simulation type.

IllustrisTNG
The IllustrisTNG simulations have been run with the AREPO code6 (Springel 2010) and employ the same subgrid physics as the original IllustrisTNG simulations (Weinberger et al. 2017;Pillepich et al. 2018b).In these simulations, A SN1 and A SN2 control two properties of galactic winds: the energy emitted per unit of starformation rate and the wind speed, respectively.A AGN1 and A AGN2 represent the energy released per unit of black hole accretion rate and the ejection speed (burstiness) for the kinetic mode of black hole feedback.We refer the reader to Villaescusa-Navarro et al. (2021) for further details on these simulations and their astrophysical parameters.

SIMBA
The SIMBA simulations have been run with the GIZMO code 7 (Hopkins 2015) and employ the same subgrid physics as the original SIMBA simulation (Davé et al. 2019).The parameters A SN1 and A SN2 control the mass loading factor and speed of galactic winds relative to scalings derived from the FIRE simulations (Muratov et al. 2015;Anglés-Alcázar et al. 2017b).The parameter A AGN1 determines the momentum flux of kinetic outflows in quasar and jet-mode AGN feedback relative to the black hole accretion rate (Anglés-Alcázar et al. 2017a) while A AGN2 parametrizes the speed of the jet-mode black hole feedback.We refer the reader to Villaescusa-Navarro et al. (2021) for further details on these simulations and their astrophysical parameters.
We note that although both the IllustrisTNG and SIMBA simulations aim at modeling the properties of cosmic gas, dark matter, and galaxies in a given cosmological model, the way they solve the hydrodynamic equations and implement their subgrid physics is substantially different.Thus, since neither simulation is a priori more accurate than another, this should also be seen as another factor to marginalize over when performing cosmological tasks such as parameter inference.For instance, for inference on the value of the cosmological parameters, it would be desirable that results do not depend on the simulation suite used for training a given model (see 3.1.4for more details).

N-body
The gravity-only simulations have been run with the Gadget-III code (Springel 2005).They only follow the evolution of dark matter particles which represent the cold dark matter plus baryon fluid, but as opposed to the above hydrodynamic simulations, they do not solve the hydrodynamic equations or model astrophysical effects such as supernova and AGN feedback.Thus, the data from these simulations is not affected by astrophysical effects, and therefore, CMD data created from these simulations can be seen as a pristine sample.These simulations only follow the evolution of the total matter field.
There is one N-body simulation for each hydrodynamical simulation in the IllustrisTNG and SIMBA suites.This is the reason why the number of 2D maps and 3D grids for the total matter field from these simula-7 http://www.tapir.caltech.edu/∼ phopkins/Site/GIZMO.html tions is twice as large as those from the IllustrisTNG and SIMBA simulations.In other words, for each total matter map or grid, there is an N-body counterpart.
The N-body simulations use the same random amplitudes and phases for the initial perturbations as their hydrodynamical analogs, such that they follow 'the same' universe, except for including only dark matter and not normal matter and correspondingly following only gravity and not other physics.

Fields
Before describing how we generate the CMD data from the CAMELS simulations, we first outline the different fields that comprise CMD.
The hydrodynamic simulations contain four different types of particles: 1) gas, 2) dark matter, 3) stars, and 4) black holes.The N-body simulations only have dark matter particles.All particle types have positions, velocities, and masses.The gas particles also contain a set of properties describing the physical state of the gas element they represent: pressure, temperature, metallicity, neutral hydrogen, electron number density, and magnetic fields (only in the case of IllustrisTNG simulations).
Each particle type represents a resolution element of its corresponding component, e.g. a gas particle represents a fluid element of cosmic gas.The actual 3D shape of these elements can be quite complicated8 but we approximate them as uniform spheres for simplicity, characterized by their window function and where R is the radius of the sphere and the integral runs over all 3D space.In the case of gas and dark matter particles, we set this radius to the distance to the 32nd closest9 gas and dark matter particles, respectively.For stars and black hole particles we set R = 0, as the resolution of our 2D maps and 3D grids is too coarse to be able to resolve the internal structure of galaxies, i.e. the window function is just a Dirac delta.
We now describe in more detail each CMD field that represents the spatial distribution of a different property of gas, dark matter, and stars in a given universe.In the following equations for each field, the window function we refer to is the one in Eq. 1 for 3D, and its projection when working in 2D (see 2.3 for further details).Table 1 summarizes the fields considered for each simulation suite.
• Gas density.This field represents the spatial distribution of the density of cosmic gas.To construct 2D maps and 3D grids for this field we need to read the positions and masses of all gas particles in a given simulation.For this field, the quantity stored in every pixel/voxel is the density of gas from all particles that contribute to that location, or more formally where W g,i , M g,i , and r g,i are the window function, mass, and position of the gas particle i.The sum runs over all gas particles and the integral is over the area/volume of a given pixel/voxel.Q represents the area of a pixel in the case of 2D maps or the volume of a voxel for the 3D grids.This field is only present in the IllustrisTNG and SIMBA simulations.
• Gas velocity.This field represents the spatial distribution of the modulus of the peculiar velocity vector of cosmic gas v g = | v g |.To generate 2D maps and 3D grids for this field we need to read the positions, masses, and velocities of all gas particles in a given simulation.The quantity stored in a pixel/voxel is the mass-weighted modulus of the gas velocity from all gas particles contributing to that location, or more formally where W g,i , M g,i , v g,i , and r g,i are the window function, mass, velocity, and position of the gas particle i.The sum runs over all gas particles and the integral is over the area/volume of the considered pixel/voxel.This field is only present in the IllustrisTNG and SIMBA simulations.Table 1.This table shows the number of 2D maps and 3D grids available for the different fields and simulations.All 2D maps are at z = 0 and contain 256 2 pixels.The 3D grids contain 128 3 , 256 3 , or 512 3 voxels and are at redshifts z = 0, z = 0.5, z = 1, z = 1.5 or z = 2.We also quote the prefix used for the different fields together with the units used to store the information in the 2D maps and 3D grids.The exponent A that appears in the units of several densities has a value of 2 in the case of 2D maps and of 3 for 3D grids.The symbol * denotes that for each 2D map and 3D grid of the total matter field of the IllustrisTNG and SIMBA simulations, there is a N-body counterpart.
We note that the above quantity should not be seen as the gas bulk velocity, which should be computed as Instead, in our definition each gas particle makes a positive contribution to the quantity, and its magnitude will be larger for higher velocities, independently of their direction.
• Gas temperature.This field represents the spatial distribution of the temperature of the cosmic gas.To generate 2D maps and 3D grids we need to read the positions, masses, and temperatures of all gas particles in a given simulation.The quantity stored in a pixel/voxel is the mass-weighted temperature of cosmic gas from all gas particles contributing to that location, or more formally where W g,i , M g,i , T i , and r g,i are the window function, mass, temperature, and position of the gas particle i.The sum runs over all gas particles and the integral is over the area/volume of the considered pixel/voxel.This field is only present in the IllustrisTNG and SIMBA simulations.
• Gas pressure.This field represents the spatial distribution of the pressure of the cosmic gas.To generate 2D maps and 3D grids we need to read the positions, masses, and pressures of all gas particles in a given simulation.The quantity stored in a pixel/voxel is the mass-weighted pressure of cosmic gas from all gas particles contributing to that location, or more formally where W g,i , M g,i , P i , and r g,i are the window function, mass, pressure, and position of the gas particle i.The sum runs over all gas particles and the integral is over the area/volume of the considered pixel/voxel.This field is only present in the IllustrisTNG and SIMBA simulations.
• Gas metallicity.This field represents the spatial distribution of the metallicity of the cosmic gas.The metallicity is defined as the ratio between the mass in metals 10 and the total gas mass: Z = M metal /M g .To generate 2D maps and 3D grids we need to read the positions, masses, and metallicities of all gas particles in a given simulation.The quantity stored in a pixel/voxel is the mean metallicity of cosmic gas from all gas parti-cles contributing to that location, or more formally where W g,i , M g,i , Z i , and r g,i are the window function, mass, metallicity, and position of the gas particle i.The sum runs over all gas particles and the integral is over the area/volume of the considered pixel/voxel.This field is only present in the Illus-trisTNG and SIMBA simulations.
• Neutral hydrogen density.This field represents the spatial distribution of the density of neutral hydrogen.To generate 2D maps and 3D grids we need to read the positions and neutral hydrogen masses of all gas particles in a given simulation.The quantity stored in a pixel/voxel is the total neutral hydrogen density from all gas particles contributing to that location, or more formally where W g,i , M HI,i , and r g,i are the window function, neutral hydrogen mass, and position of the gas particle i.The sum runs over all gas particles and the integral is over the area/volume of the considered pixel/voxel.Q represents the area of a pixel in the case of 2D maps or the volume of a voxel for the 3D grids.This field is only present in the IllustrisTNG and SIMBA simulations.
• Electron number density.This field represents the spatial distribution of the density of electrons.
To generate 2D maps and 3D grids we need to read the positions and electron abundances of all gas particles in a given simulation.The quantity stored in a pixel/voxel is the total electron number density from all gas particles contributing to that location, or more formally where W g,i , n e,i , and r g,i are the window function, number of electrons, and position of the gas particle i.The sum runs over all gas particles and the integral is over the area/volume of the considered pixel/voxel.Q represents the area of a pixel in the case of 2D maps or the volume of a voxel for the 3D grids.This field is only present in the IllustrisTNG and SIMBA simulations.
• Magnetic fields.This field represents the spatial distribution of the magnitude of the magnetic field vector of the cosmic gas: B = | B|.To generate 2D maps and 3D grids we need to read the positions, masses, and magnetic field modulus of all gas particles in a given simulation.The quantity stored in a pixel/voxel is the mass-weighted modulus of the magnetic field vector from all gas particles contributing to that location, or more formally where W g,i , B i , and r g,i are the window function, magnetic field modulus, and position of the gas particle i.The sum runs over all gas particles and the integral is over the area/volume of the considered pixel/voxel.This field is only present in the IllustrisTNG simulations.
• Magnesium over Iron ratio.This field represents the spatial distribution of the ratio between the magnesium and iron masses of the cosmic gas.
The mass of each gas particle represents the sum of the mass in hydrogen, helium, and metals of that particle.Among the metals, our simulations track the masses of the carbon, nitrogen, oxygen, magnesium, silicon, and iron elements.This field thus represents the ratio between the masses of those two elements that belong to each gas particle.To generate 2D maps and 3D grids we need to read the positions, magnesium masses, and iron masses of all gas particles in a given simulation.
The quantity stored in a pixel/voxel is the ratio between all magnesium and iron masses from all gas particles contributing to that location, or more formally where W g,i , M Mg,i , M Fe,i , and r g,i are the window function, magnesium mass, iron mass, and position of the gas particle i.The sum runs over all gas particles and the integral is over the area/volume of the considered pixel/voxel.This field is only present in the IllustrisTNG and SIMBA simulations.
• Dark matter density.This field represents the spatial distribution of the dark matter density.To generate 2D maps and 3D grids we need to read the positions and masses of all dark matter particles in a given simulation.The quantity stored in a pixel/voxel is the dark matter density from all dark matter particles contributing to that location, or more formally where W dm,i , M dm,i , and r dm,i are the window function, mass, and position of the dark matter particle i.The sum runs over all dark matter particles and the integral is over the area/volume of the considered pixel/voxel.Q represents the area of a pixel in the case of 2D maps or the volume of a voxel for the 3D grids.This field is only present in the IllustrisTNG and SIMBA simulations11 .
• Dark matter velocity.This field represents the spatial distribution of the modulus of the peculiar velocity vector of the dark matter v dm = | v dm |.To generate 2D maps and 3D grids for this field we need to read the positions, masses, and velocities of all dark matter particles in a given simulation.
The quantity stored in a pixel/voxel is the massweighted modulus of the dark matter velocity from all dark matter particles contributing to that location, or more formally where W dm,i , M dm,i , v dm,i , and r dm,i are the window function, mass, velocity, and position of the dark matter particle i.The sum runs over all dark matter particles and the integral is over the area/volume of the considered pixel/voxel.This field is only present in the IllustrisTNG and SIMBA simulations.
We note that the above quantity should not be seen as the dark matter bulk velocity, which should be computed as Instead, in our definition each gas particle has a positive contribution and its magnitude will be larger for higher velocities, independently of their direction.
• Stellar mass density.This field represents the spatial distribution of the stellar mass density.
To generate 2D maps and 3D grids we need to read the positions and masses of all star particles in a given simulation.The quantity stored in a pixel/voxel is the stellar mass density from all star particles contributing to that location, or more formally where W * ,i , M * ,i , and r * ,i are the window function, mass, and position of the star particle i.
The sum runs over all star particles and the integral is over the area/volume of the considered pixel/voxel.Q represents the area of a pixel in the case of 2D maps or the volume of a voxel for the 3D grids.This field is only present in the Il-lustrisTNG and SIMBA simulations.
• Total matter mass.This field represents the spatial distribution of the total matter mass.Total matter is defined as the sum of gas, dark matter, stars, and black holes and thus represents the total amount of mass, baryonic as well as dark, in a given universe.To generate 2D maps and 3D grids we need to read the positions and masses of all gas, dark matter, stars, and black holes particles in a given simulation.The quantity stored in a pixel/voxel is the total matter density from all different particles contributing to that location, or more formally where W g,i , W dm,i , W * ,i , and W bh,i are the window function of i gas, dark matter, star, and black hole particle, respectively.M g,i , M dm,i , M * ,i , M bh,i represent the mass of the i gas, dark matter, star, and black hole particle, correspondingly.r g,i , r dm,i , r * ,i , r bh,i are the position of the i gas, dark matter, star, and black hole particle.The sums run over all particles of the different types and the integral is over the area/volume of the considered pixel/voxel.Q represents the area of a pixel in the case of 2D maps or the volume of a voxel for the 3D grids.
We note that in the case of the N-body simulations, we only need to read the positions and masses of the dark matter particles and evaluate the second term of the above expression.This field is present in all three simulation types: Il-lustrisTNG, SIMBA, and N-body.We emphasize that the number of 2D maps and 3D grids from the N-body simulations is equal to the sum of those from the IllustrisTNG and SIMBA simulations, since for each map and grid from these simulations there is an N-body counterpart.

2D Maps
We now describe the method we use to generate the 2D maps and their characteristics.In Table 1 we summarize the number and properties of the CMD 2D maps.
The 2D maps are generated as follows.First, we consider a given simulation and read the positions and properties of the considered field (see the above subsection for further details on each field).Next, we compute the radius of the considered particles as the distance to the 32nd closest gas and dark matter particle (in the case of gas and dark matter particles, respectively), or set it to zero for star and black hole particles.We then consider a slice of dimensions 25 × 25 × 5 (h −1 Mpc) 3 and select the particles that lie inside it.For each simulation we take 15 slices: 5 in the XY plane, 5 in the XZ plane, and 5 in the YZ plane.We note that slices with the same projection direction do not overlap in space.We then project the window function of the considered particles into 2D: where we have made use of cylindrical coordinates, and z represents the axes along which we project the slice.Note that by construction, the quantities will be preserved in the projection, i.e. x Finally, we deposit the properties associated to these circles into a regular 2D grid with 256 × 256 pixels by performing the integrals of section 2.2.We do this numerically, by sampling each circle with 1,000 tracers that are distributed such that all of them cover the same area, and assign the contribution of each tracer to the pixel that contains its center.Although this procedure is approximate, in the limit that the number of tracers tends to infinity, one recovers the correct exact result of depositing circles into a 2D regular grid.We have checked that with 1,000 tracers our results are converged.We note that more accurate results can be obtained, if needed, from the 3D grids, whose constructions follows a very different procedure to the 2D maps (see section 2.4 for more details).All 2D maps are created at z = 0 and contain 256×256 pixels over an area of 25 × 25 (h −1 Mpc) 2 .For each field we generate 15,000 maps: 15 maps per simulation for 1,000 simulations.Each 2D map is described by either two or six numbers: two cosmological parameters (all maps) and four astrophysical parameters (only the maps generated from the IllustrisTNG and SIMBA simulations).
The reason why CMD only contains 2D maps at z = 0 is because it is possible to generate maps from the 3D grids (see 2.4), and also to allow users to fully reproduce the results obtained for the parameter inference task described below.For instance, one can take a slice of a 3D grid and project it into a 2D plane.From the 3D grids we can generate 2D maps with different widths, at different resolutions, and different redshifts.Furthermore, in some cases one may want to use 2D maps that partially overlap in the projected direction.We thus recommend readers to use the 3D grids to generate 2D maps that fulfill their needs.
We note that different fields from the same simulation may exhibit a very tight spatial correlation on large scales.This can be seen in Fig. 1, where we show an example of CMD maps of the IllustrisTNG simulations for all fields.In the online documentation we show examples of how to read and manipulate the files containing the 2D maps.We also describe there how to generate 2D maps from the 3D grids.

3D grids
We now describe the method we employ to generate the 3D grids and present their features.Table 1 shows the characteristics of the CMD 3D grids.
The 3D grids are constructed as follows.First, we read the positions and properties of the considered particles (see 2.2 for details).Next, we compute the radius of the considered particles as the distance to the 32nd closest gas and dark matter particle (in the case of gas and dark matter particles, respectively), or set it to zero for star and black hole particles.Next, we deposit the properties sampled by the spheres into a regular 3D grid with 128 3 , 256 3 , or 512 3 voxels using voxelize 12 , a code that calculates the integrals of section 2.2 in a precise way, making use of the overlap code (Strobl et al. 2016)   Table 2.Each 2D map and 3D grid of CMD is characterized by six numbers (two in the case of data from the N-body simulations).This table shows the range of variation and the distribution of each parameter.log uniform means that the distribution of the parameter is uniform when the logarithm of the value is taken.
We emphasize that the window function of the particles is given by Eq. 1 and is not the same for different particles.This means that the window function cannot be trivially removed from power spectra in order to enable direct comparison to theory predictions.However, for machine learning applications this is not an issue, and we believe that the choice of spherical kernels with physically motivated radii leads to smoother fields which may benefit training of convolutional neural networks.The 3D grids are constructed at redshifts 0, 0.5, 1, 1.5, and 2. For each simulation, field, and redshift, CMD provides 3 grids with 128 3 , 256 3 , and 512 3 voxels.We note that these three grids represent exactly the same field at different resolutions.All grids cover a comoving periodic volume of (25 h −1 Mpc) 3 .Different fields from the same simulation may exhibit a very tight spatial correlation on large scales, as in the case of the 2D maps.
In the online documentation we provide examples illustrating how to read the files containing the 3D fields and how to manipulate the data.

Labels
The 2D maps and 3D grids from the IllustrisTNG and SIMBA simulations are characterized by six numbers: two cosmological parameters (Ω m and σ 8 ), and four astrophysical parameters (A SN1 , A SN2 , A AGN1 , A AGN2 ).In the case of maps and grids from the N-body simulations, the labels are only Ω m and σ 8 .By construction, each parameter can vary within a wide range; many values are so extreme that the corresponding universes simulated in CAMELS are far from reality.This is however not a problem as we want our results to be unaffected by our priors (Villaescusa-Navarro et al. 2020).In Table 2 we show the range of variation of each parameter together with its distribution.
While the values of the cosmological parameters represent the same property in all simulations, 2D maps, and 3D grids, the same is not true for the astrophysi-cal parameters, whose definition and implementation is different in the IllustrisTNG and SIMBA simulations.Thus, if a neural network is trained to infer the value of Ω m and σ 8 from 2D maps of IllustrisTNG simulations, the same network can be tested on 2D maps from the SIMBA and N-body simulations to see if it is able to recover the values of those parameters.On the other hand, a network trained on, e.g., SIMBA 3D grids to infer the values of the astrophysical parameters, is not expected to work when tested in 3D grids from the IllustrisTNG simulations.
For cosmological applications, in general, the astrophysical parameters and their effects should be taken as nuisance parameters to be marginalized over.

Symmetries
It is important to know the symmetries of the CMD data to exploit them, or to enforce them, when using machine learning models or other methods.
The simulations are equivariant under rotations around integer multiples of π/2 (equivariance for other rotation angles holds approximately), parity, and translations.In order to implement the latter, periodic boundary conditions apply (not for cropped regions, of course).This information can be useful in some cases.For instance, instead of padding convolutional layers with zeros, one can use periodic padding for the same task, which may improve the performance of the model.

Disk space
We now briefly describe the disk space needed to store the different CMD elements.The property stored in every pixel of a 2D map and in every voxel of a 3D grid is a float that takes 4 bytes.Thus, a 2D map occupies 256 × 256 × 4 = 0.25 MB.In CMD, every field map contains 15,000 maps, so these files will require 3.7 GB per field.Since CMD has 27 different files for the 2D maps (13 fields for IllustrisTNG, 12 fields for SIMBA, and 1+1 field for N-body 14 ), storing all CMD 2D maps will require ∼ 100 GB.
The files hosting the 3D grids instead contain 1,000 grids per field, so each of those field files will occupy N 3 × 4 × 1000 bytes, i.e. 7.8 GB (N = 128), 62.5 GB (N = 256), 500 GB (N = 512).Hence, the 27 field files and three different resolutions combined will require 15 TB for all 3D grids at a single redshift.All CMD grids, at the 5 different redshifts require 75 TB.
We note that in the future we will generate more 2D maps and 3D grids at additional redshifts.The online documentation will always be updated to reflect any new data that is not described in this paper.

APPLICATIONS
In this section we outline a few possible tasks that can be carried out with CMD.The list of applications discussed below is far from comprehensive, but is rather just a subset of all possible applications that can be carried out with such a rich and complex dataset.

Parameter inference
One of the main applications of CMD is parameter inference: given a 2D map or a 3D grid, X, develop a method that predicts the posterior of the parameters where θ can be a single parameter, e.g.Ω m , or several or them such as We emphasize that due to cosmic variance, i.e. due to the finite volume covered in the simulations, no one-toone correspondence between the 2D maps or 3D grids and the values of the parameters exists.The inference can be carried out from a single field, or several fields can be used together in what is called a multifield.
Being able to extract cosmological information, at the field level, from 2D maps and 3D grids while marginalizing over astrophysical effects is one of the main goals of modern cosmology.CMD represents a state-of-the-art dataset that is optimized for this task.In our companion papers (Villaescusa-Navarro & et al. 2021a,b) we have made use of CMD to perform this task for the first time.Here, we now describe in detail the architecture and training procedure used in those works, and describe the problems we encountered as a challenge to the community.We note that in those works we only used the 2D maps, so all the details below apply to this data and not to the 3D grids.We also make publicly available all codes and network weights for this task at https://camels-multifield-dataset.readthedocs.io.
The way we carried out the inference in our companion papers (Villaescusa-Navarro & et al. 2021a,b) was to predict the mean and standard deviation of the marginal posterior for each parameter.Thus, given a 2D map, the network will output two numbers for each parameter.

Architecture
The architecture of our model consists of a set of convolutional layers (CNNs) followed by a fully connected layer.In detail: 45. Output: mean + std posterior (12 numbers) where C is the number of channels of the input map; for single fields C = 1 while this number is larger than one in the case of multifield.H is a hyper-parameter that controls the number of channels in the different CNNs; larger values will increase the number of channels and therefore make the network more complex.DR is the dropout rate, which is another hyper-parameter of the network.The notation CNN(K,S,P) indicates a CNN layer with kernel size K, strides S, and padding P. The input and output number of channels can be inferred from the scheme.E.g. the fourth CNN (step 8 above) takes as input 2H channels of images having 128 × 128 pixels and returns 4H channels of images that have 128× 128 pixels.All CNN layers use periodic padding.
FC(A,B) indicates a fully connected layer with A and B input and output values, respectively.The model has twelve outputs, corresponding to the mean and standard deviation of the marginal posterior for each of the six cosmological and astrophysical parameters.In the case of N-body maps, we set the number of output features to 4, since these maps are fully characterized by Ω m and σ 8 .
(24) In this notation, X represents a 2D map.Following the moments network work presented in Jeffrey et al. (2020), we define the loss function such that the output of the network converges to the above quantities: We note that this loss function differs from the original one presented in Jeffrey et al. (2020), .
(26) We replace the arithmetic sum by the sum of the logarithm of each term (either posterior mean or posterior standard deviation).Empirically, we have found that our modified loss function provides much tighter and reliable values for µ i and σ i than its original version.
The reason for this is that different parameters can be constrained more accurately than others by the network.For instance, there are fields where the impact of astrophysical processes is very mild, while they are very sensitive to cosmology (e.g. the dark matter field).In those cases, the loss function in Eq. 26 will be eventually dominated by the contribution from the astrophysical parameters, preventing the network from further improving constraints on the cosmological parameters.
By taking the logarithm of each term, before the sum, as in Eq. 25, we are effectively multiplying the losses of all terms instead of summing them, which prevents the problem mentioned above from occurring.A different way to see this is that when taking the logarithm of the terms, we are rescaling all terms to the same order of magnitude, and therefore the loss function will give a similar weight to all terms.

Training procedure
Each field contains 15,000 2D maps that we split into training (13,500 maps), validation (750 maps), and testing (750 maps) sets.This split is not done randomly from the maps themselves, but rather based on the simulations they were generated from, such that maps that have the same value of the cosmological and astrophysical parameters all belong to only one of these three sets.Thus, we take maps from 900, 50, and 50 simulations for training, validation, and testing, respectively, which amounts to the numbers of maps above given that there exist 15 maps for each simulation.We do this in this way to avoid hidden correlations between maps from the same simulation that we do not want the network to learn.
Our model has four hyper-parameters: 1) the learning rate (lr), 2) the weight decay (wd), 3) the number of channels in the CNNs (H), and 4) the dropout rate of the fully connected layers (DR).For a given value of the hyper-parameters, we train the above network by minimizing the loss function of Eq. 25 using gradient descent.We employ the AdamW optimizer (Loshchilov & Hutter 2017) with betas equal to 0.5 and 0.999. 15e train using a cyclicLR scheduler (Smith 2015) with a minimum learning rate of 10 −9 and a maximum equal to lr.We take 500 steps up and another 500 steps down to define the period of the scheduler.We use a batch size of 128 and train the model for 200 epochs.We save the weights of the model with the lowest validation loss.
The value of the hyper-parameters is optimized using the optuna package (Akiba et al. 2019).For each field, we consider at least 50 trials, where a trial represents the result of training the network with a given value of the hyper-parameters.Thus, for each field, we save the weights of at least 50 different models.
Optuna produces a database with the information of every trial, such as trial number, value of the hyperparameters, validation loss, etc.In the online documentation we explain how to read these databases together with the files containing the weights of the networks.With those files in hand, the model can be tested on either the test set or a new dataset.

Challenges
In our companion papers (Villaescusa-Navarro & et al. 2021a,b) we show that the above architecture allows us to place a few percent constraints on the value of the cosmological parameters from almost all of the 13 different fields.However, we also encountered a series of obstacles and leave some work for future exploration.
Here we outline what we believe are the most important challenges when doing parameter inference from CMD.
First, while our model is able to infer the value of the cosmological and astrophysical parameters with high accuracy for all the fields, the question remains of whether ours is the best model that can be constructed.It would be interesting to find models that perform better, i.e. that can constrain the values of the parameters with higher accuracy.Knowing these constraints will allow us to better understand a deep and crucial question in cosmology: How much information do non-linear Gaussian fields contain?The codes and weights released in this work can be taken as a benchmark to improve upon.
Second, we found that for some fields, e.g.gas temperature, if we train the model using IllustrisTNG maps and test it on SIMBA maps (and vice-versa), it is not able to recover the true value of the cosmological parameters.It is crucial for the model to be robust to the training data since models trained on simulated data may always face this problem, as simulations may never be perfect representations of reality.Thus, the second challenge is to find robust models that can be trained on a given simulation dataset and will be able to perform inference when testing on a different dataset.
Third, in our companion paper (Villaescusa-Navarro & et al. 2021a) we found that when doing parameter inference over 2D maps with different fields as different channels (multifield ) the constraints on the parameters improve with respect to a single field.However, the improvement is relatively modest, i.e. we did not observe a major improvement.Thus, it will be interesting to find the minimum set of fields of a multifield that contain, e.g., 90% and 95% of the cosmological and possibly astrophysical information.In other words, if using the 13 different fields allows us to place a given constraint on the value of the cosmological and astrophysical parameters, what minimum subset of these fields enables us to get a fraction of those constraints?The idea behind this exercise is that those subsets of fields can be used to infer the value of the cosmological and astrophysical parameters, and the rest of the fields can be used to perform internal cross-validations.For instance, if the gas temperature field is not used to infer the parameter values, it still can be used when running simulations with the most likely parameter values to compare directly against observations, providing an additional internal check to verify the robustness of the model.
Fourth, it would be very important to understand where the information from the different fields is coming from.In other words, what summary statistic, if any, are the CNNs using in order to constrain the value of the cosmological parameters?Are they focusing their attention into regions with large values of the considered field?Shedding light into this question will help to develop analytic methods to more robustly extract that information but also help us in understanding the process of non-linear gravitational evolution.
Of course, the above challenges apply to both 2D maps and 3D grids.

Generative models
While CMD data spans a wide range in the values of cosmological and astrophysical parameters, there may be some applications where more data is needed at points in parameter space not covered by CMD data.In this case, one can use techniques such as conditional GANs or conditional normalizing flows to generate new 2D maps or 3D grids conditioned on the values of the parameters (Tamosiunas et al. 2021).These emulators at the field level can be used in place of the more expensive simulations to carry out different tasks.

N-body to hydrodynamic
(Magneto-)hydrodynamic simulations are much more computationally expensive than gravity-only N-body simulations.At the time of writing this paper, running full hydrodynamic simulations over gigaparsec volumes with a reasonable resolution is computationally unfeasible.On the other hand, the N-body counterparts of these simulations are presently beginning to be feasible with advanced supercomputers.Thus, it may be desirable to paint gas and star properties onto the dark matter field from N-body simulations (i.e. to transform dark matter into gas and star properties), as in Wadekar et al. ( 2021 Alternatively, the total matter field from the N-body simulations can be mapped to the total matter of the full hydrodynamic simulation.This will be necessary for creating weak lensing maps that incorporate astrophysics effects at the field level. Since the CMD spans a large volume in parameter space, the above mapping(s) can be done conditionally on the values of the cosmological and astrophysical parameters.Furthermore, the mapping can be done for several fields at the same time in order to take into account all cross-correlations among the fields.

Super-resolution
CMD provides 3D grids at three different resolutions for 13 different fields.It would be very interesting to train models that can take as input the low-resolution map or grid from a given field and output a higher resolution version of it.We emphasize that the three different grids for each redshift and field provided by CMD arise from the same simulation, i.e. the mass and spatial resolution of the underlying simulation is the same.
Ideally, one would like to run a low-resolution hydrodynamic simulation and use a model that can produce a higher resolution version of it.This will be extremely valuable for the astrophysics community as the computational cost quickly increases with mass and spatial resolution.While this task has been carried out for gravityonly simulations (Kodi Ramanah et al. 2020;Li et al. 2021;Ni et al. 2021), it still remains to be performed for cosmological hydrodynamic simulations.While the only difference between the three different CMD 3D grids is in the resolution of the mesh size rather than the resolution of the underlying simulation, developing superresolution methods for fields from cosmological hydrodynamical simulations using CMD has the potential to contribute towards the development of such methods for simulations of different resolutions.

Time evolution
With enough disk space, it is possible to save as many snapshots of a simulation as desired.In practice, the number of snapshots generated is limited due to disk space constraints.It would be very valuable to have a model that given a set of snapshots at some redshifts could output snapshots at other redshifts.This will be valuable for understanding the time evolution of some phenomena and for constructing merger trees and lightcones from the simulations.Chen et al. (2020) showed that this is possible for gravity-only simulations.CMD provides a rich dataset to train models to carry out this task for many different fields of hydrodynamic simulations.

SUMMARY
In this paper we have introduced the CAMELS Multifield Dataset, CMD, a large cosmological and astrophysical dataset that contains hundreds of thousands of 2D maps and 3D grids of 13 different fields: 1) gas mass, 2) gas velocity, 3) gas temperature, 4) gas pressure, 5) gas metallicity, 6) neutral hydrogen mass, 7) electron density, 8) magnetic fields, 9) magnesium over iron ratio, 10) dark matter mass, 11) dark matter velocity, 12) stellar mass, and 13) total matter mass.CMD has been created from simulations of the CAMELS project (Villaescusa-Navarro et al. 2021), a collection of more than 4,000 gravity-only N-body simulations and state-of-the-art hydrodynamic simulations from thousands of different universes.Each 2D map and 3D grid is described by two cosmological and four astrophysical parameters (only in the case of the hydrodynamic simulations).
We have described a few applications of CMD: 1) parameter inference, 2) generative models, 3) mapping N-body to hydrodynamic, 4) super-resolution, and 5) time evolution.In our companion papers (Villaescusa-Navarro & et al. 2021a,b) we used CMD to show that neural networks can extract information from the field while marginalizing over astrophysical effects for all CMD fields.In this paper we have described in detail the architecture of our model together with the training procedure.
We release all CMD data (over 70 Terabytes), together with the codes and network weights for the parameter inference task carried out in our companion papers (Villaescusa-Navarro & et al. 2021a,b).We provide further technical details on how to download, read, and manipulate the data in https://camels-multifield-dataset.readthedocs.io.We hope that CMD can become a standard dataset for machine learning applications in cosmology and astrophysics.
In the future, we will create maps for directly observable quantities from weak lensing, thermal and kinetic Sunyaev-Zeldovich effects, and X-ray and 21 cm emission.Future public data releases from the CAMELS simulation suite will also include a database specifically targeted at multi-wavelength probes of circumgalactic medium profiles.

Figure 1 .
Figure 1.We show 10 examples of 2D maps for each of the 13 different fields present in the IllustrisTNG simulations.Each image has a different value of the cosmological parameters, Ωm and σ8, and astrophysical parameters ASN1, ASN2, AAGN1, AAGN2.The different columns represent the same region for the different fields.Each image has a physical size of 25 × 25 (h −1 Mpc) 2 .