Schlagwort-Archive: Big Data Analytics

45 Techniques Used by Data Scientists

These techniques cover most of what data scientists and related practitioners are using in their daily activities, whether they use solutions offered by a vendor, or whether they design proprietary tools. When you click on any of the 45 links below, you will find a selection of articles related to the entry in question. Most of these articles are hard to find with a Google search, so in some ways this gives you access to the hidden literature on data science, machine learning, and statistical science. Many of these articles are fundamental to understanding the technique in question, and come with further references and source code.

Starred techniques (marked with a *) belong to what I call deep data science, a branch of data science that has little if any overlap with closely related fields such as machine learning, computer science, operations research, mathematics, or statistics. Even classical machine learning and statistical techniques such as clustering, density estimation,  or tests of hypotheses, have model-free, data-driven, robust versions designed for automated processing (as in machine-to-machine communications), and thus also belong to deep data science. However, these techniques are not starred here, as the standard versions of these techniques are more well known (and unfortunately more used) than the deep data science equivalent.

To learn more about deep data science,  click here. Note that unlike deep learning, deep data science is not the intersection of data science and artificial intelligence; however, the analogy between deep data science and deep learning is not completely meaningless, in the sense that both deal with automation.

Also, to discover in which contexts and applications the 40 techniques below are used, I invite you to read the following articles:

Finally, when using a technique, you need to test its performance. Read this article about 11 Important Model Evaluation Techniques Everyone Should Know.

The 40 data science techniques

  1. Linear Regression
  2. Logistic Regression
  3. Jackknife Regression *
  4. Density Estimation
  5. Confidence Interval
  6. Test of Hypotheses
  7. Pattern Recognition
  8. Clustering – (aka Unsupervised Learning)
  9. Supervised Learning
  10. Time Series
  11. Decision Trees
  12. Random Numbers
  13. Monte-Carlo Simulation
  14. Bayesian Statistics
  15. Naive Bayes
  16. Principal Component Analysis – (PCA)
  17. Ensembles
  18. Neural Networks
  19. Support Vector Machine – (SVM)
  20. Nearest Neighbors – (k-NN)
  21. Feature Selection – (aka Variable Reduction)
  22. Indexation / Cataloguing *
  23. (Geo-) Spatial Modeling
  24. Recommendation Engine *
  25. Search Engine *
  26. Attribution Modeling *
  27. Collaborative Filtering *
  28. Rule System
  29. Linkage Analysis
  30. Association Rules
  31. Scoring Engine
  32. Segmentation
  33. Predictive Modeling
  34. Graphs
  35. Deep Learning
  36. Game Theory
  37. Imputation
  38. Survival Analysis
  39. Arbitrage
  40. Lift Modeling
  41. Yield Optimization
  42. Cross-Validation
  43. Model Fitting
  44. Relevancy Algorithm *
  45. Experimental Design

Source: https://www.datasciencecentral.com/profiles/blogs/40-techniques-used-by-data-scientists

Most dangerous attack techniques, and what’s coming next 2018

RSA Conference 2018

Experts from SANS presented the five most dangerous new cyber attack techniques in their annual RSA Conference 2018 keynote session in San Francisco, and shared their views on how they work, how they can be stopped or at least slowed, and how businesses and consumers can prepare.

dangerous attack techniques

The five threats outlined are:

1. Repositories and cloud storage data leakage
2. Big Data analytics, de-anonymization, and correlation
3. Attackers monetize compromised systems using crypto coin miners
4. Recognition of hardware flaws
5. More malware and attacks disrupting ICS and utilities instead of seeking profit.

Repositories and cloud storage data leakage

Ed Skoudis, lead for the SANS Penetration Testing Curriculum, talked about the data leakage threats facing us from the increased use of repositories and cloud storage:

“Software today is built in a very different way than it was 10 or even 5 years ago, with vast online code repositories for collaboration and cloud data storage hosting mission-critical applications. However, attackers are increasingly targeting these kinds of repositories and cloud storage infrastructures, looking for passwords, crypto keys, access tokens, and terabytes of sensitive data.”

He continued: “Defenders need to focus on data inventories, appointing a data curator for their organization and educating system architects and developers about how to secure data assets in the cloud. Additionally, the big cloud companies have each launched an AI service to help classify and defend data in their infrastructures. And finally, a variety of free tools are available that can help prevent and detect leakage of secrets through code repositories.”

Big Data analytics, de-anonymization, and correlation

Skoudis went on to talk about the threat of Big Data Analytics and how attackers are using data from several sources to de-anonymise users:

“In the past, we battled attackers who were trying to get access to our machines to steal data for criminal use. Now the battle is shifting from hacking machines to hacking data — gathering data from disparate sources and fusing it together to de-anonymise users, find business weaknesses and opportunities, or otherwise undermine an organisation’s mission. We still need to prevent attackers from gaining shell on targets to steal data. However, defenders also need to start analysing risks associated with how their seemingly innocuous data can be combined with data from other sources to introduce business risk, all while carefully considering the privacy implications of their data and its potential to tarnish a brand or invite regulatory scrutiny.”

Attackers monetize compromised systems using crypto coin miners

Johannes Ullrich, is Dean of Research, SANS Institute and Director of SANS Internet Storm Center. He has been looking at the increasing use of crypto coin miners by cyber criminals:

“Last year, we talked about how ransomware was used to sell data back to its owner and crypto-currencies were the tool of choice to pay the ransom. More recently, we have found that attackers are no longer bothering with data. Due to the flood of stolen data offered for sale, the value of most commonly stolen data like credit card numbers of PII has dropped significantly. Attackers are instead installing crypto coin miners. These attacks are more stealthy and less likely to be discovered and attackers can earn tens of thousands of dollars a month from crypto coin miners. Defenders therefore need to learn to detect these coin miners and to identify the vulnerabilities that have been exploited in order to install them.”

Recognition of hardware flaws

Ullrich then went on to say that software developers often assume that hardware is flawless and that this is a dangerous assumption. He explains why and what needs to be done:

“Hardware is no less complex then software and mistakes have been made in developing hardware just as they are made by software developers. Patching hardware is a lot more difficult and often not possible without replacing entire systems or suffering significant performance penalties. Developers therefore need to learn to create software without relying on hardware to mitigate any security issues. Similar to the way in which software uses encryption on untrusted networks, software needs to authenticate and encrypt data within the system. Some emerging homomorphic encryption algorithms may allow developers to operate on encrypted data without having to decrypt it first.”

most dangerous attack techniques

More malware and attacks disrupting ICS and utilities instead of seeking profit

Finally, Head of R&D, SANS Institute, James Lyne, discussed the growing trend in malware and attacks that aren’t profit centred as we have largely seen in the past, but instead are focused on disrupting Industrial Control Systems (ICS) and utilities:

“Day to day the grand majority of malicious code has undeniably been focused on fraud and profit. Yet, with the relentless deployment of technology in our societies, the opportunity for political or even military influence only grows. And rare publicly visible attacks like Triton/TriSYS show the capability and intent of those who seek to compromise some of the highest risk components of industrial environments, i.e. the safety systems which have historically prevented critical security and safety meltdowns.”

He continued: “ICS systems are relatively immature and easy to exploit in comparison to the mainstream computing world. Many ICS systems lack the mitigations of modern operating systems and applications. The reliance on obscurity or isolation (both increasingly untrue) do not position them well to withstand a heightened focus on them, and we need to address this as an industry. More worrying is that attackers have demonstrated they have the inclination and resources to diversify their attacks, targeting the sensors that are used to provide data to the industrial controllers themselves. The next few years are likely to see some painful lessons being learned as this attack domain grows, since the mitigations are inconsistent and quite embryonic.”

Source: https://www.helpnetsecurity.com/2018/04/23/dangerous-attack-techniques/