Sulstice2

Sulstice2 OP t1_iy2e0kf wrote

Mobile Friendly Demo:

https://sulstice.github.io/Faith/enamine_database/index.html

Motivation:

I am prepare for PyData Global 2022 and want to present the utility of a software we have in house called the Charmm General Force Field (CGenFF) which can explain a molecule's features by giving it a atom type. To demonstrate the power of a new software with python we can process the Enamine Database of 22 Billion compounds through our pipeline to generate a massive set of chemical space.

My pipelines started running and processed the first 1,000,000. I made a map key from the atom type language to a chemist's lamen terms to help describe different atoms. Still mapping out that dictionary to be more robust.

In the Gif I show above, you can see which type of atoms show up more based on the thickness of the line and their connections to others. Some atoms are more diverse and some only bond to one type. Alkynes are rare compared to others but bridged systems are very common as much as aliphatic to me.

Software:

I had to use C++ for the Force Field to process the Enamine DB, Python to do data processing and transformation and d3 for the visualization. I tried something different on setting up the amount of curvature for the arcs between connections and I could start to create this ball in the middle like a flower.

Here is the Data:

https://github.com/Sulstice/Faith/blob/main/enamine_database/atom_type_group_new.json

I wonder what will change as I sample more data and what becomes common.

12

Sulstice2 OP t1_ix8eivb wrote

Molecules interact with each other and they prefer to be in a specific orientation or geometry that is the most energetically favorable. Whatever takes the least amount of work. The equation helps us determine that by separating the energy into different components of physical and electronic characteristics.

In a Force Field we start off small with simple molecular systems and then apply it to larger systems in predicting how atoms will move based on their energy.

So for example, the energy interactions and orientations we use for simple alcohols or carboxylic acids can be applied to lipid membranes and simulating them.

Does that make sense?

1

Sulstice2 OP t1_ix664u0 wrote

Hi Josh,

That's actually a really good idea and I think that would help a lot. I actually mapped out the atom names to something like that already so this would be something I can prepare in my next round before the bigger talk.

Anymore I will gladly accept, data visualization I really want to get this information out to the public in the most efficient manner and it's been a little struggle.

Yeah took me awhile to record all the chemicals. About 2-3 years.

32

Sulstice2 OP t1_ix5t0l5 wrote

There's a belief that the charmm equation for which this language is built on (the nodes) is the equation of simulating life.

I also believe that in the sea of chemical data we can filter data based on how common or useful it is to a particular community.

By connecting the two we can start to map out atom types of relevance to people. We can predict new chemical space based on their atom types.

So like let's say we want to predict a new sunscreen that doesn't harm the environment. We can use these relations to predict something better by know the features of a molecule.

The pseudo part is the belief that it will work.

4

Sulstice2 OP t1_ix5hq64 wrote

Hello,

Website & Mobile Friendly: https://sulstice.github.io/Faith/global_chem/index.html

I sampled the most commonly recorded chemicals across different sub-communities to understand what are the most common atoms and what together in pairs are the most common. Different communities meaning different classes of chemicals (Cannabis, Things used in Sex Products, Toxic Agents used in War, Food Colour additives, Materials, Cosmetics, Birth Control etc.)

https://github.com/Sulstice/global-chem/blob/development/global_chem/GlobalChem_Dictionary%20(1).pdf

In the chord diagram above, each node is an atom type that exists within the dataset and each link is a bond between the atom type. The thickness of the line correlates to how many of those particular atom types exist together. The Pink correlates to how much two different hydrogens exist and and the Blue represents a hydrogen and carbon. The rest of the plot is colored light grey.

Next what I did is pass them through something called the CHARMM ForceField which has a language where you can declare different types of atoms like an alkane vs an aromatic. If you see the plot I am highlighting HGA1, HGR62, these are methyl hydrogens and benzene hydrogens in our language.

That data is available here, feel free to play around with it:

https://raw.githubusercontent.com/Sulstice/Faith/main/global_chem/atom_type_group_new.json

Still a work a progress as I get it ready for the PyData Global. I think there are some bugs. The code is here:

https://github.com/Sulstice/Faith/blob/main/global_chem/index.html

6