trnka

trnka t1_j3t1t18 wrote

If you're doing the preprocessing and feature selection manually (meaning without the use of a library), yeah that's a pain.

If you're using sklearn, generally if you do all your preprocessing and feature selection with their classes in a sklearn pipeline you should be good. For example, if your input data is a pandas dataframe you can use a ColumnTransformer to tell it which columns to preprocess in which ways, such as a OneHotEncoder on categorical columns. Then you can follow it up with feature selection before your model.

Sklearn's classes are implemented so that they only train the preprocessing and feature selection on the training data.

1

trnka t1_j3ldgc9 wrote

Microsoft has a good checklist to consider if you haven't seen it.

There are many publications on fairness nowadays so I'd also suggest reading some survey papers. Here are a few that have a good number of citations:

I'm pretty sure there are many workshops and conferences on fairness in AI nowadays too that would be good for ideas. There are even ML toolkits to help detect or reduce bias these days, so those would be good to search for.

Hope this helps! Fairness has become a pretty big area over the last several years

1

trnka t1_j3i3vk4 wrote

If your input is only ever a single word, that's right.

Usually people work with texts, or sequences of words. The embedding layer maps the sequence of words to a sequence of embedding vectors. It could be implemented as a sequence of one-hot encodings multiplied by the same W though.

2

trnka t1_j3i2zxx wrote

You don't need to choose, and there's definitely a market for people that are capable of both good software engineering and good machine learning. Personally I'm a big believer in being well-rounded in terms of skills.

If I had to guess, what you're saying might just mean that you have more to learn about software engineering than machine learning right now. And that'll change over time.

1

trnka t1_j3g6cax wrote

It depends on what you want to do:

  • If you just want to apply NER, I'd recommend Spacy because it's fast and they have pretrained models for many languages.
  • If you're looking to fine-tune or train your own NER, either Spacy or Huggingface to use BERT.
  • If you're looking to build your own neural network architecture for NER, PyTorch is most popular.
1

trnka t1_j3g5uer wrote

You're right that it's just a matrix multiply of a one-hot encoding. Though representing it as an embedding layer is just faster.

I wouldn't call it a fully-connected layer though. In a fully-connected layer, the input to the matrix multiply is the output of everything in the previous layer, not just the output of a single unit. The weights that multiply the output(s) of the first unit are not the same weights multiplying the output of any other unit.

It's more like a length 1 convolution that projects the one-hot vocab down to the embedding space.

3

trnka t1_j3g5f4a wrote

Yeah that's pretty common. If you'd like to do more machine learning, as your team and company grows you might try asking your boss to hire more SDEs so that you can spend more time with machine learning. Or alternatively, ask for more training so that the backend engineering goes more quickly.

As for "keeping up with the field", I don't recommend worrying about it. It's challenging, maybe impossible, to actually stay up to date on everything even if it's only ML. I find it's better to make a habit of learning something every day, however small, and focus on the growth aspect rather than some sense of "falling behind".

1

trnka t1_j2d4wt7 wrote

There must be a name for this but I don't know it. It's a common problem when merging data sources.

If you have a good amount of data on existing mappings, you could learn to predict that mapping for each input field. The simplest thing that comes to mind is to use character ngrams of the source field name and predict the correct target field name (or predict that there's no match).

If you also have a sample of data from the customer, you could use properties of the data in each field as input as well -- the data type, range of numeric values, ngrams for string fields, string length properties, etc.

As for the business problem, even with automated mapping you probably need to force customers to review and correct the mappings or else you might end up with complaints from customers that didn't review.

All this isn't quite by area of expertise, hope this helps!

1

trnka t1_j285nir wrote

Nice to see other people doing Xai in healthcare! Here are some of the ideas I would've loved to explore, no clue if they'd work out but they might be fun:

  • Extract features from dermatology images to be used in interpretable models -- Things like whether it's a bullseye, raised, etc. Those features could be used in an interpretable diagnosis model, potentially even a rule-based one made by doctors. Then it could be used to generate part of the clinical note. One of my coworkers did a hackathon project on it and I think it's got potential. I have no idea how to design it to be robust against errors in the feature extractor though, like if it hallucinates a bullseye.
  • I'm curious about using NLG models for explanation. Maybe one of the GPTs on Huggingface with some prompt work could do it -- something like "We're going to explain the choice of diagnosis in medical cases. Features of patient 1: ... Explanation in favor of sinusitis: ...., Features of patient N: ... Explanation in favor of covid:" It wouldn't be a real explanation but it might be more usable than existing options, especially if there's a lot of textual data from the patient

There was an interesting panel debate about Xai at ML4H this year. I don't think the recordings are online but the papers are online at least. Mihaela van der Shaar brought up some interesting work too, such as learning interpretable equations for healthcare.

3

trnka t1_j23ajbf wrote

I'd recommend starting with the [Andrew Ng Coursera specialization](https://www.coursera.org/specializations/machine-learning-introduction#courses). It's free and will give you a good base to build upon. I feel like he explains concepts very well and is good about explaining terminology.

> What would it take to be able to learn to train a model and deploy in-house solutions? For my company, I'd like to take our knowledge base and SOP and turn it into an interactive guide you can ask questions.

If the SOP is fairly short you can add it to your ChatGPT prompt and it can do Q&A from that. I found [Learn Prompting](https://learnprompting.org/docs/intro) helpful to understand how to do this.

I'm not sure about the knowledge base but it might be possible to inject that as knowledge too. The challenge is that there's a max input length.

But let's take a step back for a moment -- in general it's not too hard to learn ML basics and be able to build some model. Like it might take a few weekends depending on your schedule and previous experience with programming and math. If you want to solve a question answering problem, how much you need to learn will depend a great deal on how well you need it to work. For instance, you could probably get by with a simple search system for many things but it might not meet your bar for quality.

> I apologize if machine learning isn't the same thing as A.I.

I think of AI as the broader term and generally I think about the [AIMA table of contents](http://aima.cs.berkeley.edu/contents.html) for the general scope of AI -- machine learning is in there but there's a lot of other stuff too like logic, planning, ontologies, and optimization problems. That said, in the news AI is often used to mean "any technology that seems magical" and that's problematic because things like chess bots seemed magical in the past but no longer seem magical. So the scope of the term has shifted over time.

1

trnka t1_j1hewwg wrote

It's ok but not great. Like say your model doesn't always converge, that would be one way to deal with it.

I'd prefer to see someone tune hyperparameters so that the metrics are minimally sensitive to the random seed though

1

trnka t1_j1heqwa wrote

In actual product work, it's rarely sufficient to look at a single metric. If I'm doing classification, I typically check accuracy, balanced accuracy, and the confusion matrix for the quality of the model among other things. Other factors like interpretability/explainability, RAM, and latency also play into whether I can actually use a different model, and those will depend on the use case as well.

I would never feel personally comfortable with deploying a model if I haven't reviewed a sample of typical errors. But there are many people deploy models without that and just rely on metrics. In that case it's more important to get your top-level metric right, or to get product-level metrics right and inspect any trends in say user churn.

> Do you see qualitative improvements of models as more or less important in comparison to quantitative?

I generally view quantitative metrics as more important though I think I value qualitative feedback much more than others in the industry. For the example of bias, I'd say that if it's valued by your employer there should be a metric for it. Not that I like having metrics for everything, but having a metric will force you to be specific about what it means.

I'll also acknowledge that there are many qualitative perspectives on quality that don't have metrics *yet*.

> do you ever read papers that just improved SOTA without introducing significant novel ideas?

In my opinion, yes. If your question was why I read them, it's because I don't know whether they contribute useful, new ideas until after I've read the paper.

Hope this helps - I'm not certain that I understood all of the question but let me know if I missed anything

2

trnka t1_j1hdb1o wrote

For prediction of mortality I'd suggest looking into survival analysis. The challenge with mortality is that you don't know when everyone will die, only some of those that have happened so far. They call this data censoring. So to work with data they reframe the problem into "predict whether patient P will be alive after D days since their operation"

A quick Google suggests that 90-day mortality is a common metric so I'd suggest starting there. For each patient you'd want to record mortality at 90-days as alive/dead/unknown. From there you could use traditional machine learning methods.

If the time points are standardized across patients you could use them like regular features, for instance feature1_at_day1, feature1_at_day2, ... If they aren't standardized across patients you need to get them into the same representation first. I'd suggest starting simple, maybe something like feature1_week1_avg, feature1_week2_avg, and so on. If you want to get fancier about using the trend of the measurement as input, you could fit a curve to each feature for each patient over time and use the parameters of the curve as inputs. Say if you fit a linear equation, y = mx + b, where x = time since operation and y = the measurement you care about. In that case you would fit m & b and then use those as inputs to your model. (All that said, definitely start simple)

The biggest challenge I'd expect is that you probably don't have a lot of mortality so machine learning is likely to overfit. For dealing with that I'd suggest starting very, very simple like regularized logistic regression to predict 90-day mortality. Keep in mind that adding features may not help you if you don't have much mortality to learn from.

Hope this helps! I've worked in medical machine learning for years and done some survival analysis but not much. We were in primary care so there was very little mortality to deal with.

1

trnka t1_j1hcd0f wrote

Adding a practical example:

I worked on SDKs for mobile phone keyboards on Android devices. The phone manufacturers at the time didn't let us download language data so it needed to ship on the phones out of the box. One of the big parts of each language's data was the ngram model. Quantization allowed us to save the language model probabilities with less precision and we were able to shrink them down with minimal impact on the quality of the language model. That extra space allowed us to ship more languages and/or ship models with higher quality in the same space.

1

trnka t1_j1hbzj4 wrote

Not all positions will be open to non-traditional backgrounds, but many are. It looks to me like bigger or older companies can often look for a "traditional" resume for ML and smaller/younger companies tend to be more open-minded and focus on testing actual skills.

I've worked with a surprising number of physics PhDs in the ML space, at least in the US, so I don't see your background as a problem.

I know one of the physics PhDs did a coding bootcamp to transition into industry and that helped a lot.

1

trnka t1_j0hr6zn wrote

Yeah our doctors spent a lot of time building and revising clinical guidelines for our practice.

I'm not sure what your background is, but some tips from working with them on clinical guidelines:

  • There were some guidelines that were generally-accepted best practices in medicine, but it was more common to have clinic-specific guidelines
  • My team ran into some resistance to the idea of ML-created guidelines. Physicians were more receptive to technology that assisted them in creating guidelines
  • Many guidelines are aspirational, like when to order lab tests. Many patients just won't get the tests, or the test results will come back after the current condition has resolved. Likewise, if you're worried about a patient taking the antibiotic to term, it may be better to use a 1-dose second-line antibiotic rather than a multi-dose first-line antibiotic. In the long term I expect that clinical guidelines will adapt somewhat to patient adherence; they aren't a one-time thing. Plus research changes too.
  • For any evidence, there needs to be vetting of how it's gathered, like whether it's a proper randomized control trial, how the statistics are done, how the study is designed, what population was studied, etc
5

trnka t1_j0emdr8 wrote

Very cool! I worked closely our doctors on ML features at a telemedicine startup, let me check some of the things I know about:

- What's the most effective antibiotic for a UTI? -> "effective" was a poor word choice on my part, it gave 1 drug that we didn't use, 1 that was a last line of defense type of drug, and then a drug class

- What's the best first-line antibiotic for a UTI -> agreed with our clinical best practices (Nitrofurantoin)

- I tried asking for when to diagnose bacterial vs viral common colds if a lab test can't be done - no results (best practice was symptoms not improving >10 days to treat as bacterial)

- "When is tamiflu effective?" If I remember right, our guidelines were first day of infection or first two days if the patient's immunocompromised, lives with someone immunocompromised, or works in healthcare. The system was sorta right: "Oseltamivir (Tamiflu) is effective for the treatment and prevention of influenza in adults, adolescents, and children, and early initiation of treatment provides greater clinical benefits."

- How does coffee affect blood pressure? I remembered it increased after drinking in a BP test we ran. That showed up in the results, but the results had both 1) studies about immediate effects and 2) studies about long-term effects which have different conclusions.

When the query fails, I wish it wouldn't just delete it - it'd be nice to have it still there so I can share it with you, or for myself to revise the query. I also had one query get "stuck" but all the steps transitioned to checked.

If you can get access to a site like UpToDate, many of our doctors used that for clinical best practices. Very few searched the medical literature directly.

​

I'll share with some of my doctor friends and hopefully they'll give you actual medical feedback rather than the secondhand knowledge I have.

49

trnka t1_iz9ol1s wrote

> one record close to another in x, y, z will likely have a similar outcome

That sounds a lot like k-nearest neighbors, or SVM with RBF kernel. Might be worth giving those a shot. That said, xgboost is effective on a wide range of problems so I wouldn't be surprised if it's tough to beat. Under the hood I'm sure it's learning approximated bounding boxes for your classes.

I haven't heard of CNNs being used for this kind of problem. I've more seen CNNs for spatial processing when the data is represented differently, for example if each input were a 3d shape represented by a 3d tensor rather than coordinates.

2

trnka t1_iz9j30k wrote

It definitely depends on sector/industry and also the use case for the model. For example, if you're building a machine learning model that might influence medical decisions, you should put more time into validating it before anyone uses it. And even then, you need to proceed very cautiously in rolling it out and be ready to roll it back.

If it's a model for a small-scale shopping recommendation system, the harm from launching a bad model is probably much lower, especially if you can revert a bad model quickly.

To answer the question about the importance of validating, importance is relative to all the other things you could be doing. It's also about dealing with the unknown -- you don't really know if additional effort in validation will uncover any new issues. I generally like to list out all the different risks of the model, feature, and product. And try to guesstimate the amount of risk to the user, the business, the team, and myself. And then I list out a few things I could do to reduce risk in those areas, then pick work that I think is a good cost-benefit tradeoff.

There's also a spectrum regarding how much people plan ahead in projects:

  • Planning-heavy: Spend months anticipating every possible failure that could happen. Sometimes people call this waterfall.
  • Planning-light: Just ship something, see what the feedback is, and fix it. The emphasis here is on a rapid iteration cycle from feedback rather than planning ahead. Sometimes people call this agile, sometimes people say "move fast and break things"

Planning-heavy workflows often waste time on unneeded things, and fail to fix user feedback quickly. Planning-light workflows often make mistakes on their first version that were knowable, and can sometimes permanently lose user trust. I tend to lean planning-light, but there is definite value in doing some planning upfront so long as it's aligned with the users and the business.

In your case, it's a spectrum of how much you test ahead of time vs monitor. Depending on your industry, you can save effort by doing a little of both rather than a lot of either.

I can't really tell you whether you're spending too much time in validation or too little, but hopefully this helps give you some ideas of how you can answer that question for yourself.

2

trnka t1_iz4nfux wrote

Depends on the kind of model. Some examples:

  • For classification, a confusion matrix is a great way to find issues
  • For images of people, there's a good amount of work to detect and correct racial bias (probably there are tools to help too)
  • It can be helpful to use explainability tools like lime or shap -- sometimes that will help you figure out that the model is sensitive to some unimportant inputs and not sensitive enough to important features
  • Just reviewing errors or poor predictions on held-out data will help you spot some issues.
  • For time-series, even just looking at graphs of predictions vs actuals on held-out data can help you discover issues
  • For text input, plot metrics vs text length to see if it does much worse with short texts or long texts
  • For text input, you can try typos or different capitalization. If it's a language with accents, try inputs that don't have proper accents

I wish I had some tool or service recommendations. I'm sure they exist, but the methods to use are generally specific to the input type of the model (text, image, tabular, time-series) and/or the output of the model (classification, regression, etc). I haven't seen a single tool or service that works for everything.

For hyperparameter tuning even libraries like scikit-learn are great for running it. At my last job I wrote some code to run statistical tests assuming that each hyperparam affected the metric independently and that helped a ton, then did various correlation plots. Generally it's good to check that you haven't made any big mistakes with hyperparams (like if the best value is the min or max of the ones you tried, you can probably try a wider range).

Some of the other issues that come to mind in deployment:

  • We had one pytorch model that would occasionally have a latency spike (like <0.1% of the time). We never figured out why, except that the profiler said it was in happening inside of pytorch.
  • We had some issues with unicode input -- the upstream service was sending us latin-1 but we thought it was utf8. We'd tested Chinese input and it didn't crash because the upstream just dropped those chars, but then crashed with Spanish input
  • At one point the model was using like 99% of the memory of the instance, and there must've been a memory leak somewhere cause after 1-3 weeks it'd reboot. It was easy enough to increase memory though
  • One time we had an issue where someone checked in a model different than the evaluation report
1

trnka t1_iz1nk60 wrote

Oh interesting paper - I haven't seen that paper before.

For what it's worth, I haven't observed double-descent personally, though I suppose I'd only notice it for sure with training time. We almost always had typical learning curves with epochs - training loss decreases smoothly as expected, and testing loss hits a bottom then starts climbing unless there's a TON of regularization.

We probably would've seen it with the number of model parameters cause we did random searches on those periodically and graphed the correlations. I only remember seeing one peak on those, though we generally didn't evaluate beyond 2x the number of params of our most recent best.

I probably wouldn't have observed the effect with more data because our distribution shifted over the years, for instance in 2020 we got a lot more respiratory infections coming in due to COVID which temporarily decreased numbers then increased them because it's easier to guess than other things.

2

trnka t1_iz0722k wrote

If possible, find some beta testers. If you're in industry try to find some non-technical folks internally. Don't tell them how to use it, just observe. That will often uncover types of inputs you might not have tested, and can become test cases.

Also, look into monitoring in production. Much like regular software engineering, it's hard to prevent all defects. But some defects are easy to observe by monitoring, like changes in the types of inputs you're seeing over time.

If you're relationship-oriented, definitely make friends with users if possible or people that study user feedback and data, so that they pass feedback along more readily.

1

trnka t1_iz06hbj wrote

Multi-task learning has a long history with mixed results - sometimes very beneficial, and sometimes it just flops. At my previous job, we had one situation in which it was helpful and another situation in which it was harmful.

In the harmful situation, adding outputs and keeping the other layers the same led to slight reductions in quality at both tasks. I assume that it could've been salvaged if we'd increased the number of parameters -- I think the different outputs were effectively "competing" for hidden params.

Another way to look at this is that multi-task is effective regularization, so you can increase the number of parameters without as much risk of horrible overfitting. If I remember correctly there's research to show that overparameterized networks tend to get stuck in local minima less often.

One last story from the field -- in one of our multi-task learning situations, we found that it was easier to observe local minima by just checking per-output metrics. Two training runs might have the same aggregate metric, but one might be far better at output A and the other far better at output B.

2