
03.07.2025
Learn how to fine-tune Temperature, Top-p, Max Tokens, Frequency Penalty, and Search Limit settings to achieve creative, coherent, and cost-efficient outcomes in LLM-powered chatbot, RAG, and content generation projects. Discover parameter optimization strategies with SkyStudio-based examples.
Today, Large Language Models (LLMs) and AI-powered agents built upon them are at the forefront of groundbreaking advancements in the field of Natural Language Processing (NLP). The success of these powerful tools largely depends on the precise tuning of model parameters and search configurations. By managing these parameters, developers and users can effectively control the creativity, length, diversity of responses, and the efficiency of retrieval processes in LLM agents.
Maximizing the performance of LLMs requires more than just the strength of the model—it also depends on fine-tuning aligned with specific user and application needs. This is especially critical in advanced applications such as Retrieval-Augmented Generation (RAG) systems and conversational agents, where optimal parameter configuration plays a pivotal role. In this context, it is essential to understand the mathematical foundations of key hyperparameters like Temperature, Top-p, and Frequency Penalty, as well as to accurately interpret the interactions between them.
The Temperature parameter determines the level of randomness applied by large language models during text generation. Technically, this parameter acts as a scaling factor on the logit values used in the softmax function and is denoted by the symbol τ (tau).
As the τ (temperature) value decreases, the model tends to produce more precise and deterministic outputs. Conversely, as the τ value increases, the model’s prediction probabilities become more evenly distributed, allowing lower-probability words a greater chance of being selected.
At low temperature values, the model behaves very deterministically. In such cases, it typically selects the most probable words, producing consistent and predictable responses. This makes low temperature settings ideal for tasks that require high accuracy, such as technical writing, code completion, or solving mathematical problems.
High temperature values encourage the model to make more creative and unexpected choices. These values flatten the probability distribution, increasing the likelihood of selecting less probable words. As a result, the generated text becomes more diverse and innovative—though potentially less logically consistent. High temperature settings are well-suited for creative writing, poetry generation, or brainstorming activities.
Practical impacts of temperature settings:
Very Low Temperature: In this mode, the model operates using what is known as greedy decoding, selecting the most likely word at each step. For instance, if prompted with “It’s very hot, so I…”, the model will almost always respond with something like “turned on the air conditioner,” with a probability of around 98%. This type of usage is ideal for customer service chatbots, where consistency and reliability are crucial.
Medium Temperature: This is the default setting for many modern models, preserving the original probability distribution. It strikes a balance—producing coherent, yet not overly predictable responses. Medium temperature is ideal for many everyday use cases, offering both fluency and consistency.
High Temperature: At this setting, the model becomes much more experimental, creative, and occasionally surprising. According to a 2025 study, approximately 40% of poems generated with a τ=1.8 setting were indistinguishable from those written by humans.
High temperature settings are especially favored in creative writing, storytelling, and art-focused projects.
The Max Tokens parameter is used to define the maximum length of a response generated by large language models. When processing and generating text, language models break down input and output into smaller units called tokens. The Max Tokens value sets an upper limit on how many tokens the model can use in a single response. The model is not required to use the full limit—it may produce shorter responses—but it will never exceed the specified maximum.
Controlling the Max Tokens parameter is important for several reasons: reducing API usage costs, increasing response speed, and tailoring responses to the level of detail required by the user. From a user experience perspective, it also helps prevent excessively long answers, thereby improving readability and usability.
Setting the Max Tokens value too low may cause the model to cut off its responses prematurely. For example, it might provide only a brief and superficial answer to a complex question. This can be a serious limitation in tasks that require elaboration, summarization, or in-depth analysis, where the model may be forced to end its output before delivering a complete and informative response.
On the other hand, setting the Max Tokens value too high can also create issues. While higher token limits enable the model to deliver more comprehensive and detailed answers—useful for tasks like technical documentation, storytelling, or in-depth analysis—excessively high values may lead the model to produce unnecessarily long responses, stray off-topic, or repeat content. This can result in hallucinations, where the model generates verbose or meaningless output.
Top-p, also known as nucleus sampling, is a selection method that allows large language models to dynamically determine which tokens can be used during word generation. It restricts the model’s probability distribution based on a cumulative threshold value (p). Simply put, the Top-p parameter ensures that the model selects tokens starting from the highest probability, continuing until the total cumulative probability reaches the specified p value.
For example, when Top-p = 0.9 is set, the model considers all tokens whose combined probabilities add up to 90%, starting from the most probable word. The remaining, lower-probability tokens are completely excluded. This enables the model to choose from a dynamic token set that varies based on the distribution, rather than a fixed number of candidates.
A low Top-p value causes the model to select from only a very limited set of high-probability words. This results in more consistent and predictable outputs, though the text may appear more constrained or formulaic. Such settings are preferred in technical writing, critical instructions, or scenarios involving sensitive information, where the model is expected to produce safe and expected word choices.
A high Top-p value allows the model to consider a broader set of tokens, enabling more diverse, creative, and unexpected word choices. These settings are especially useful in tasks such as story writing, creative content generation, or open-ended brainstorming. However, as the model is given more freedom, the likelihood of off-topic or irrelevant word choices also increases.
Low Top-p: In a storytelling scenario, when asked to complete the sentence, “The wizard opened the door and saw…”, the model is likely to choose a predictable, cliché ending—such as “a dragon” or “an evil spirit.” While this reduces creativity, it provides consistency and reliability.
Medium-to-High Top-p: In the same example, a broader range of choices becomes available to the model, allowing it to generate more imaginative and surprising completions—such as “his long-lost friend” or “a forgotten childhood toy.” This enhances the narrative’s richness and originality.
For most applications, a Top-p range between 0.8 and 0.95 offers a balanced trade-off between diversity and coherence. A value of 1.0 exposes the model to the entire probability distribution, unlocking its full creative potential—but also increasing the risk of generating unexpected or off-topic outputs. The best practice is to experiment with different values in your specific use case to find the ideal Top-p setting.
Presence penalty is a penalization mechanism that activates when a word or token appears at least once in the model’s output. This penalty does not depend on how many times the word is repeated, but solely on whether it has already been used. In other words, even if a word is used only once, the model’s likelihood of selecting it again is automatically reduced. Technically, this is implemented by subtracting a fixed coefficient from the token’s logit value.
This parameter is particularly useful for preventing unconscious repetition of the same words or specific terms within a generated text. For example, in a creative writing task, repeatedly using the word “magical” in every sentence may fatigue the reader and make the content monotonous. In such cases, the presence penalty encourages the model to vary its vocabulary and use alternative expressions.
Increasing the presence penalty helps the model avoid repetitive phrasing. Instead of reusing a word it has already used, the model is more likely to choose a synonym, a related term, or an alternative phrasing. This enhances textual richness, especially in scenarios where diversity is desired.
On the other hand, when the presence penalty is set close to zero, the model behaves more freely and may repeat the same word multiple times. This can be beneficial in technical documents or scenarios where intentional repetition of a term is important. For instance, in a product manual, consistently repeating the term “data security” may be necessary for clarity and coherence.
High Presence Penalty: Let’s say a user asks the model to “describe a cat.” After using the word “cat” once, the model avoids repeating it and instead uses expressions like “this adorable animal,” “feline,” or “tiny companion.” It may even expand the narrative by describing the cat’s habitat, behaviors, and more. The result is a richer, non-repetitive piece of text.
Low Presence Penalty: In the same scenario, the model might say:
“This cat is a very cute cat because the cat snuggled next to its cat mother.”
Such repetition can reduce the quality of creative content, though it may be intentionally used in technical documents or poetic structures.
While frequency penalty is more effective in reducing repetition in technical writing, presence penalty tends to provide stronger variation in creative writing.
Frequency penalty is a penalization mechanism that reduces the probability of a word or phrase being selected again based on how frequently it has already been used by the large language model. Technically, the more a token is repeated, the lower its probability of being chosen again in future steps.
Unlike the previously mentioned presence penalty, which penalizes the mere presence of a word regardless of its repetition count, frequency penalty specifically considers how many times a word has been repeated. In other words, the more a word is repeated, the greater the penalty applied to its next occurrence.
A high frequency penalty helps prevent the model from repeating the same words over and over again. This promotes a richer vocabulary throughout the text and increases readability by reducing redundancy. It is particularly useful for avoiding immediate repetitions (such as “very very nice”) or meaningless loops. On the other hand, if the frequency penalty is set to a low value or zero, the model does not penalize repetitions. This can be desirable in certain scenarios. For example, in technical documents where consistent emphasis on specific terms is required, or in poetic and artistic texts where deliberate repetition is used, low values are preferred.
High Frequency Penalty:
When we ask the model to create a shopping list, using a high frequency penalty prevents it from repeating the same words. For example: “1. Bread, 2. Milk, 3. Cheese, 4. Eggs,” instead of repeatedly listing “milk.” Similarly, when writing a short poem, a high frequency penalty encourages the model to use synonyms or different words instead of repeating the same word over and over, resulting in a richer and more diverse text.
Low Frequency Penalty:
In the same shopping list example, with a low frequency penalty, the model is more permissive with repetition. In this case, the list might include: “1. Bread, 2. Milk, 3. Milk, 4. Eggs…” including repeated items. Additionally, in poems or slogans where emphasis through repetition is needed, low values are chosen. For example: “for you, for your sake, for your dreams…” deliberate repetitions can be preserved.
Frequency penalty is often used together with presence penalty, and these two parameters can be adjusted jointly to achieve an optimal level of repetition in the text. Balancing these parameters correctly ensures the text remains both fluent and readable while clearly delivering its message. Therefore, experimenting with different values is essential to find the most suitable setting for your specific content.
The Convert Numbers to Text parameter is a setting that enables large language models to express numbers in written words while generating or processing text. When this feature is activated, the model outputs numbers not as digits but as words. For example, the number “2025” would be converted to “two thousand twenty-five.”
The main purpose of this parameter is to improve consistency between digits and text in documents, prevent potential formatting confusion, and help the model better understand the semantic meaning of numbers. Therefore, it is especially preferred in text-to-speech (TTS) applications, official documents, contracts, or other natural language scenarios where readability is critical.
When the Convert Numbers to Text feature is enabled, model outputs gain a smoother, more natural, and consistent appearance. For example, instead of “3 apples,” the expression “three apples” would be used when the feature is on. This significantly improves user experience in scenarios where the entire text will be read as continuous prose or converted into speech.
When the parameter is disabled, numbers are left as digits. This approach is preferred in technical texts, software manuals, financial reports, or any situation involving numerical calculations, as digits make the text clearer and easier to follow.
For texts intended to be read by humans (e.g., contracts, voice assistant responses, articles, content production): it is recommended to keep this parameter active.
In technical documents, software manuals, financial reports, or contexts where precise numeric values are critical: it is recommended to keep this parameter deactivated.
Top-k restricts the number of options a large language model considers when selecting the next word during text generation to a fixed value k. At each step, the model sorts all possible tokens by probability from highest to lowest and includes only the top k tokens in the candidate pool. All other tokens beyond these top k options are ignored. In this way, the set of selectable options is narrowed, resulting in more controlled and consistent outputs.
Technically, the Top-k sampling method offers a balance point between greedy search and completely random selection. When k = 1, the model acts entirely deterministically; it always selects the highest-probability word and ignores all alternatives. As the value of k increases, a wider choice set is created, which introduces more randomness and diversity.
Keeping the Top-k value low forces the model to choose words from a small, narrow pool. This can improve text consistency because the model only selects from the highest-probability words. However, if k is too small, creativity can be limited, leading to repetitive or overly similar expressions.
Raising the Top-k value broadens the range of options the model considers when choosing words. This enables more diverse, creative, and unexpected word choices. However, selecting from a larger pool can sometimes increase the chance of irrelevant or out-of-context words being chosen.
Low Top-k Value:
When a user asks, “Why does the Sun shine?” with k = 1, the model will always return the same highest-probability answer: “The Sun shines because of nuclear fusion.” This is particularly preferred in technical documents, legal texts, mathematical calculations, or tasks where precise and consistent answers are needed.
High Top-k Value:
For the same question, the model may choose from 50 different word options at each step and thus generate slightly different expressions each time. For example: “The Sun radiates energy produced by hydrogen fusion in its core, which we see as light,” or “The Sun’s brightness is a result of atoms fusing in its center, generating vast amounts of energy.” In this way, the same core information is preserved while using different phrasing.
We can also observe the effect of Top-k in story completion. For example, when continuing “The hero entered the forest and…”, a low k value will typically lead to predictable, common endings. With a high k value, the model might continue with unexpected, creative options such as “discovered a lost ancient city.”
Low Top-k: Preferred in technical documents, legal texts, mathematical calculations, or tasks requiring precise answers.
Medium and High Top-k: Produce better results in creative content, stories, chatbots, or scenarios that demand variety and creativity in responses.
Typically, the Top-k parameter is used together with the Top-p parameter to achieve both consistency and dynamic, creative outputs. Balancing these two parameters allows the model to hit the ideal point between excessive predictability and excessive randomness.
Merge chunking is a specialized text pre-processing method used in the information retrieval processes of LLM-based agents. This technique, especially for handling long documents and extensive texts, consists of two fundamental steps:
In the first stage, long documents are divided into smaller text segments (chunks) according to a specific strategy. This splitting process is typically performed as fixed-length paragraphs or groups of sentences.
In the second stage, these small text segments are merged again based on their semantic similarity. The goal is to restore the contextual integrity that might be lost during splitting and to present related information as larger, more meaningful blocks in search results. In this way, while improving text processability, information integrity is also preserved.
When merge chunking is enabled, the model uses the divided text segments as larger and semantically coherent blocks. This makes it easier for the agent to access and understand the information. For example, imagine splitting a document into 100-word segments. If an important explanation is split across two different chunks, the merge chunking method can detect that these two pieces are semantically related and rejoin them. This allows the model to see the entire piece of information at once and provide a more accurate and comprehensive answer. If this feature is disabled, the model has to work with only the fixed-length small chunks, which can cause context fragmentation. In cases where a heading and its explanation end up in different chunks, the model would evaluate them separately, resulting in incomplete or fragmented content.
In long documents where contextual integrity is critical (legal documents, definitions, stories, etc.), merge chunking is beneficial. In these texts, preserving context is important, and re-merging small chunks to maintain semantic integrity improves the quality of results.
When each piece is already independent and meaningful (for example, frequently asked questions (FAQs), short bullet-point explanations), there is no need for merge chunking. Unnecessarily merging semantically weak pieces could negatively affect system performance.
As a result, merge chunking is a powerful technique that, when applied correctly, enables large documents to be processed efficiently, preserves semantic integrity, and improves the quality of search results. However, for effective use, it is essential to ensure that the merged pieces are genuinely semantically related.
Reranking is the process by which LLM-based agents re-evaluate information retrieval or search results in a second stage, making them more accurate and relevant. When this parameter is enabled, the agent does not directly use the retrieved results as they are but instead passes them through an additional evaluation process, reprioritizing the pieces that best match the query. If disabled, the agent uses the results in their initially retrieved order.
In the reranking stage, reranking models semantically align the initially retrieved results with the query at a deeper level and move the best-matching content to the top positions. This process is especially critical in large and heterogeneous knowledge bases.
Activating reranking can significantly improve the accuracy of an agent’s search and retrieval performance. The initial search result does not always surface the most relevant content for the query, since the first ranking often relies on superficial criteria (such as keywords or simple similarity scores). In such cases, truly relevant content may remain buried in lower ranks. Reranking re-evaluates these pieces and moves the most contextually appropriate content to the top. This ensures the model operates with the correct context and information.
When this parameter is disabled, the agent has to work only with the initial ranking and may be forced to generate answers from potentially less relevant content. This can reduce the quality of the model’s responses or result in incomplete information.
Reranking is especially beneficial when working with large and heterogeneous data sets. In such scenarios, it is difficult to ensure the quality of the initial search results, making reranking a critical step that can greatly improve answer quality, particularly in answer-focused question-answering systems.
However, if your data set is small and the first search step is already highly accurate, the additional benefit of reranking may be limited. In this case, considering the extra cost and time overhead of additional steps, reranking can be disabled.
When high accuracy and reliability are required (e.g., academic research, legal queries, medical content): it is recommended to enable reranking.
When fast access to sufficiently good initial results is needed (e.g., small-scale knowledge bases, FAQs): keeping reranking disabled may be more practical.
Search Limit is a parameter that determines the maximum number of results an LLM-based agent can retrieve during a search operation to answer a query. This parameter controls how many maximum results the agent will pull from external sources (such as web searches, API calls, or vector-based document queries). Search Limit directly affects the system’s retrieval recall. Fetching more results increases the likelihood of finding relevant information but also brings in many potentially unnecessary results. Therefore, correctly setting this parameter is highly important for the agent’s overall performance.
Keeping the Search Limit value high allows the agent to obtain a broader set of results during its information retrieval process. This is advantageous especially in large and complex knowledge bases or web searches, where the information you need might not be among the top few results. In such cases, a high Search Limit helps capture critical information even from deeper-ranked documents. However, this also increases the chance of including irrelevant and excessive content among the retrieved results. As a result, these may need to be further filtered or reranked afterward. Additionally, since LLMs have a limited context window, feeding too much unnecessary data into the model can negatively impact performance.
With a low Search Limit value, the agent examines only the top few likely results. While this improves search speed and focus, it also increases the risk of missing the correct answer if it isn’t among the initial results. In other words, a low Search Limit provides higher precision but lower recall, while a high Search Limit provides higher recall but lower precision.
Keeping a moderate-to-high Search Limit to increase coverage, and then using reranking or filtering methods to select only the most relevant few results to pass to the LLM, is ideal. This maximizes coverage while minimizing the model’s context load.
A low Search Limit can be preferred in scenarios where fast responses and low costs are required; however, it is important to consider the risk of the correct answer not being among the first few results.
A very high Search Limit may be useful in large knowledge bases (e.g., wiki pages, multi-page document collections) or in cases where queries are ambiguous. However, if used without an effective filtering or reranking step, it can overload the LLM with too much irrelevant data.
Ultimately, depending on the size of your data set, query types, and the LLM’s context capacity, it is best to adjust and test this parameter to find the optimal point.
Search threshold, or the searchThreshold parameter, is a similarity score that determines how relevant the results returned for a query need to be. It typically takes a value between 0 and 1 and is especially used in vector-based search systems. For example, when the searchThreshold is set to 0.8, only results with a similarity score of 0.8 or higher are considered, and matches below this score are filtered out. This threshold prevents low-quality or irrelevant content from being passed to the LLM, ensuring that the model only works with information truly related to the topic. In this way, the accuracy and contextual consistency of the outputs are increased.
High threshold value (0.8+): Only content with very high similarity scores is sent to the model. As a result, content with little relevance to the topic is eliminated, reducing the likelihood of the model encountering misleading information. However, if the threshold is set too high, no content might meet the requirement, and the model may respond with something like “no information found on this topic.”
Low threshold value (0.4–0.6): More results are returned, providing broader coverage (high recall). However, the likelihood of including irrelevant or weakly related results in the model’s context also increases. In this case, the model may focus on unnecessary information, make incorrect inferences, or generate unsafe answers. Especially when the data sources include low-quality content, this can lead to hallucinations.
In conclusion, the searchThreshold value is a critical parameter for balancing search quality and coverage. It is recommended to adjust it according to your dataset, user expectations, and the system’s tolerance for errors.
Top N is the parameter that determines the number of results to be included in the LLM agent’s context after search or reranking operations. First, a broad pool of results is created using searchLimit, then the most relevant content is promoted to the top via reranking. Top N controls how many of these candidates are actually passed to the model. This number is usually set based on the model’s context window capacity (e.g., 4K, 8K, or 32K token limits) and the depth of information desired in the task. The term “N” is generally used for re-ranked and filtered content.
Low Top N value:
The model only accesses a few highly relevant pieces. This means less information clutter and clearer answers. However, if one of these pieces is missing or misleading, the model’s response may also be limited.
High Top N value:
The model has the chance to generate answers from a larger set of content. This is especially useful for multi-angle analyses or comparative answers. However, if the information from different sources is conflicting, it can become harder for the model to focus, and there is a higher risk of including irrelevant information.
For accuracy-focused, single-answer questions, Top N should be kept low.
For summarization, comparison, or questions requiring multi-perspective analysis, a higher Top N value can be used. The LLM’s context limit must always be considered. For example, if you provide 10 separate documents to a model with a 4096-token limit, some documents may be processed only partially, or the model may not be able to devote enough attention to each document.
In a typical RAG (Retrieval-Augmented Generation) scenario, the following pipeline is applied:
A broad search is performed → A large set of candidate content is collected using searchLimit.
Ranking is done → The most relevant content is promoted to the top via reranking.
Selection is made → The best few pieces of content are sent to the LLM using Top N.
This structure preserves high retrieval recall through large-scale data scanning, while limiting the volume of content the LLM works with, thus improving both quality and speed.
Rerank threshold is a cutoff value that determines which results will be filtered out or included in the LLM context based on the relevance score calculated by a reranking model for each document. These scores are usually normalized between 0 and 1, and a value closer to 1 indicates that the document has a very strong connection to the query. Rerank threshold acts as a cut-off point on these scores, allowing only documents deemed sufficiently relevant to be included in the context. In this way, the model works only with truly meaningful content when generating responses.
High rerank threshold:
The model processes only high-scoring, directly relevant documents. This reduces unnecessary information load and increases the model’s focus. However, if the threshold is too strict, valuable documents may be excluded due to small score differences. For example, if a document is left out because it scored 0.78, the model may provide an incomplete response.
Low rerank threshold:
In this case, almost all documents are included in the model’s context. This can allow erroneous or irrelevant content to slip in. The LLM then has to parse through this text, which may lower response quality or lead to biased/incorrect inferences.
The rerank threshold value should be carefully chosen based on the system’s intended use and quality expectations. The best approach is to test with different queries, analyze the resulting score distributions, and set a meaningful cutoff point accordingly. Considering this distribution, a moderately high threshold is a good starting point in most scenarios. In systems where critical accuracy is required (such as medical or legal applications), the threshold should be kept high. If information coverage is also important, the threshold can be lowered slightly and then combined with limiting the number of documents.
Rerank threshold and Top N parameters usually work together:
For example, if rerank threshold = 0.6 and Top N = 5: First, documents with scores below 0.6 are filtered out.
The remaining documents are sorted by their scores.
The top 5 highest-scoring documents are sent to the model.
This approach both filters out irrelevant content and transfers the most meaningful documents to the LLM, allowing it to produce more accurate and focused responses.
Balancing Creativity and Consistency
The Temperature, Top-p, and Top-k parameters together determine the level of creativity in the model’s output. If you want highly creative and diverse texts, it is beneficial to increase the temperature value and keep top-p wide.
However, in scenarios requiring technical accuracy and stability, the temperature should be kept around 0.0–0.4, and top-k should be set to limited values.
Preventing Repetition
Presence penalty and frequency penalty allow you to control the model’s tendency to repeatedly generate the same words or ideas.
For long responses, applying both penalties at moderate levels is usually sufficient to prevent the model from continuously returning the same expressions.
Especially when producing educational or explanatory outputs, optimizing these settings directly affects text quality.
Controlling Response Length and Format
Max tokens limits the total length of a response. Very short values may cause the answer to be cut off early; very high values may cause the model to drift off-topic. Formatting options such as convert numbers to text should also be adjusted according to the target audience. For humans, numbers should be presented in written form (“fifteen”), while for machine processing, they should be shown as digits (“15”).
Search and Reranking Strategy
In retrieval-based systems (e.g., RAG), you should first collect as many documents as possible using searchLimit, then rank these documents by relevance scores using a reranking model.
Afterwards, by combining rerankThreshold and Top N, only the most meaningful content should be filtered and passed to the LLM.
This structure reduces unnecessary text load while enabling the model to produce more accurate and focused responses.
All the parameters mentioned in this article can be directly controlled by users within the SkyStudio platform. These values can be defined separately for each assistant, allowing the system’s behavior to be finely tuned according to the task type and context.
While high accuracy and strict filtering are prioritized in an information-focused assistant, a more flexible and broad generation space may be needed in a scenario focused on, creative text production. SkyStudio makes it possible to configure these needs, offering a flexible control area both to users who want to experiment with settings and to professional users who expect reliable system outputs.
Understanding how these parameters affect system behavior and fine-tuning accordingly plays a decisive role in the success of an LLM-based assistant. By simplifying this process, SkyStudio supports you in achieving better results both technically and practically.
Get in Touch to Access Your Free Demo