Focus: The essence of Mr. Zhao Cheng's operation and maintenance system management course

The four modules of our column are:

Application operation and maintenance system construction

  • Why doesn't Netflix have an operations job?

    A reasonable organizational structure is a necessary condition to ensure the implementation of the technical architecture. The fundamental solution is to use technical means to solve the efficiency and stability problems encountered in the operation and maintenance process.
  • In the era of microservice architecture, why should the construction of the operation and maintenance system take "application" as the core?

    Under the microservice architecture mode, our operation and maintenance perspective must turn to the core concept of application, and everything must be analyzed and viewed from the perspective of application.
  • Standardization System Construction (Part 1): How to establish an application standardization system and model?

    The standard first, the standard first, the standard first, the abstraction of the standard specification in the complex is the basis for our subsequent series of automation and stability assurance.
  • Standardization System Construction (Part 2): How to establish a standardization and service system of infrastructure?

    What we need to do can be summarized into two steps: the first step is to standardize the infrastructure, and the second step is to service the infrastructure.
  • How to view the construction of application operation and maintenance system from the perspective of life cycle?

    Start with the life cycle, divide the stages, refine the attributes, clarify the relationship, and fix the basic information of the fixed line to realize the operation and maintenance scenario.
  • Talk about the past and present of CMDB

    new era, for CMDB At this time, the change of thinking is far more important than the realization of technology.
  • With CMDB, why do you need application configuration management?

    CMDB It is resource-oriented management and the cornerstone of operation and maintenance. Application configuration management is application-oriented management and is the core of operation and maintenance.
  • How to implement the concept of application in CMDB?

    The manifestation of operation and maintenance capabilities must be the manifestation of the overall technical architecture capabilities. It is meaningless to separate the two to see them separately.
  • How to create an operation and maintenance organization structure?

    In the process of operation and maintenance, it is like stringing a string of pearls, connecting different departments of the entire platform technology, and even the development team, and evolving towards the direction of exerting the overall technical architecture operation and maintenance capabilities.
  • Interpretation of Google SRE Operation and Maintenance Mode

    SRE It is a position, but it is also an operation and maintenance concept and methodology
  • Starting from Google CRE, how does operation and maintenance cultivate service awareness?

    Whether we have a service mentality is reflected in our way of doing things, that is, whether we can stand on the other side's point of view and solve problems.

Best Practices for Efficiency and Stability

  • Continuous delivery is easier said than done. To do this, you need to understand a few key points

    Configuration management, commit management, build and deployment release are the top priority of continuous delivery, the key path, and the only way to start from code development to release.
  • The first key point of continuous delivery: configuration management

    Don't build a high platform on the floating sand. When we make tool platforms or systems, we must pay attention to the construction of the foundation.
  • How to do a good job of multi-environment configuration management in ongoing disputes?

    Environment configuration management is mainly aimed at the configuration management of application dependencies on infrastructure and basic services.
  • Dev and test scrambling for environments? It's time for multi-environment construction

    In the offline environment area, we generally build three sets of environments: the integration test environment, the development test environment and the project environment.
  • Online environment construction must withstand the test of real swords and real guns

    The pre-launch environment is like ball players. They can usually train on the training ground, but before the official competition, they must go to the official competition venue to adapt to the venue or warm up in advance.
  • More people, more power vs two pizza principles, talk about the pipeline model in continuous delivery

    The selection principle of development mode: take a look at the applicable scenarios of these modes, and secondly look at our actual usage scenarios.
  • Is it difficult to build continuous delivery pipeline software? What are the key issues?

    The efficiency and practicality of containers must be based on a more complete and highly standardized system, otherwise the tools will only become more and more chaotic.
  • Is the big announcement made after the pipeline construction in continuous delivery is completed? Don't Forget Quality Assurance

    To be clear, in the continuous delivery process, we still have to do a lot of work related to quality assurance, such as various functional and non-functional tests.
  • Is it important to do continuous delivery concepts or scenarios? See how the "stupid way" finds the best solution

    The methods we have adopted are actually stupid methods: that is, to find the problem, analyze the problem, investigate the solution, discuss the collision, and then slowly explore and practice to find the most suitable solution for us.
  • In extreme scenarios, how should we ensure stability?

    For stability, the user access model is the key. This is uncertain. Only technology is useless. This requires us to go deep into the business and understand the business.
  • Stability Practice: Business Scenario Analysis of Capacity Planning

    Capacity planning is the process of rationally expanding and effectively planning resources through the analysis of complex business scenarios and through certain technical means.
  • Stability Practice: Construction of a practical pressure measurement system for capacity planning

    Four dimensions of stress testing: stress testing granularity, stress testing interface and traffic structure method, stress testing method, and data reading and writing.
  • Stability Practice: Current Limit Downgrade

    The difficulty and key to current limiting and downgrading lie in the unification of the overall technology stack, as well as the accurate grasp and configuration of the current limiting and downgrading resource strategy for each application in the later stage.
  • Stability Practices: Switches and Scenarios

    Switches are mainly used to control the activation and deactivation of a single function, or to switch the function status between different versions. A plan can be understood as the execution of a complex plan that allows an application or business to enter a specific state.
  • Stability practice: full-link tracking system, the embodiment of technical operation capabilities

    When we build a full-link tracking system, the primary problem to be solved is to quickly and accurately locate the problem in the complex service invocation relationship.
  • Talk about my understanding of failure

    The system is normal, just a special case of the countless abnormal conditions of the system. The failure is always only a superficial phenomenon, and the technical and management problems behind it are the root cause.
  • Fault management: fault classification and responsibility

    We set the failure class to P0-P4 So 5 levels, P0 is the highest, P4 minimum
  • Failure management: encouraging things to do, not penalizing mistakes

    Such a rule suggests keeping the team in mind by setting a high-voltage line, as simple and clear as not driving after drinking.
  • Fault management: fault emergency and fault recovery

    Any plan that has not been rehearsed is a hooligan. Work hard in peacetime, pay attention to building various tools and platforms, and at the same time consider and simulate various failure scenarios as much as possible.
  • Cold lips and teeth, O&M and security

    In terms of the cooperation between the two parties (operation and maintenance and security), I have always believed that operation and maintenance should not be just a passive response, but should actively cooperate with security, build a security system, integrate with the operation and maintenance system, build a good defense line, and control it from the source.

Operation and maintenance practice in cloud computing era

  • Why did Mogujie choose to go to the cloud? Passive choice or active attack?

    If you want technology to bring more possibilities to your business, embracing cloud computing is the best choice.
  • Why is hybrid cloud the mainstream form of cloud computing in the future?

    No matter how to choose and use, we must still meet the business scenario as the starting point. Without this, it is meaningless to simply pursue technical depth and complexity.
  • Spring Cloud: Application Layer-Oriented Cloud Architecture Solution

    Spring Cloud Not only a microservice governance solution, but also a cloud architecture solution for the application layer.
  • Based on absolute advantages, talk about the rise of cloud ecology from CDN and cloud storage

    Taking advantage of cloud computing and embracing changes can bring more possibilities to our business development and innovation.
  • Tailored to the optimal solution: chatting about the static architecture of the page and the construction of the secondary CDN

    Neither public cloud nor cloud computing can provide us with a perfect customized solution. As the so-called specific analysis of specific problems, to find out the problem, optimize the solution path, and tailor-made, in order to get the most suitable "customized solution" for us
  • In the era of cloud computing, what exactly is elastic scaling when we talk about it?

    For operation and maintenance, it is necessary to accurately identify different operation and maintenance objects in the daily operation and maintenance process, and then further analyze the operation and maintenance scenarios corresponding to this object? Then it is the decomposition and development for operation and maintenance scenarios.

personal growth

  • How did I get into an operations role?

    Such a development process is not that I deliberately designed too much, and the opportunities are not deliberately won. It is that I usually do a little more and do it more seriously to ensure that I can get the structure in the end, and I work hard to get it better than expected. Some good results.
  • In the era of cloud computing and AI, how should operations and maintenance be transformed?

    The fundamental solution is to continuously learn and improve one's own skills, maintain a keen awareness of technological development trends, and make timely adjustments and responses.
  • Does operation and maintenance need to understand products and operations?

    What we emphasize is that operation and maintenance must have product and operation awareness. To sum up, the two most essential points are: first, to be able to clarify the requirements; second, to be able to promote the product.
  • Calm down and think about it, can the employee resignation really be prevented?

    Technical managers must focus on people, not just things. This is the biggest difference between being a technical backbone and a technical manager, and it is also the first step in changing ideas.
  • Building personal brand awareness: the importance of professional reputation from background checks

    The background check process is uncontrollable, but our own performance is never controllable.

