C语言字符串处理：单词统计与最长单词查找实现-嵌云网-嵌入式AI开发资源站

C语言字符串处理：单词统计与最长单词查找实现

黄泓毅

1. 统计单词个数的完整实现与优化

在C语言中统计字符串中的单词数量看似简单，但实际实现时需要考虑多种边界情况。下面我将详细介绍一个健壮的单词统计方案，并解释每个设计决策背后的原因。

1.1 基础实现思路

最直观的算法是遍历字符串，当遇到非空格字符且前一个字符是空格时，认为遇到了新单词。核心代码如下：

c复制int count_words(const char *str) {
    int count = 0;
    int in_word = 0;
    
    while (*str) {
        if (isspace(*str)) {
            in_word = 0;
        } else if (!in_word) {
            count++;
            in_word = 1;
        }
        str++;
    }
    return count;
}

这个基础版本可以正确处理普通情况，但存在几个潜在问题：

连续多个空格会被正确处理
字符串开头和结尾的空格不影响计数
但无法处理输入仅为空格的情况

1.2 处理输入边界情况

当使用fgets()读取输入时，会引入额外的复杂性：

fgets()会保留换行符'\n'
空输入(仅按回车)会产生仅包含'\n'的字符串
纯空格输入需要特殊处理

改进后的处理逻辑：

c复制char buffer[256];
fgets(buffer, sizeof(buffer), stdin);

// 移除末尾的换行符
buffer[strcspn(buffer, "\n")] = '\0';

// 统计前检查是否为空字符串
if (buffer[0] == '\0') {
    printf("Word count: 0\n");
    return;
}

1.3 完整单词统计实现

结合上述考虑，完整的单词统计函数如下：

c复制int count_words_robust(const char *str) {
    if (str == NULL || *str == '\0') {
        return 0;
    }

    int count = 0;
    int in_word = 0;
    
    while (*str) {
        if (isspace((unsigned char)*str)) {
            in_word = 0;
        } else {
            if (!in_word) {
                count++;
                in_word = 1;
            }
        }
        str++;
    }
    return count;
}

注意：使用isspace()而不是直接比较' '，可以处理所有空白字符(制表符、换行符等)。将char转换为unsigned char避免符号扩展问题。

2. 查找最长单词的实现细节

在统计单词的同时找出最长的单词，需要跟踪每个单词的起始位置和长度。

2.1 数据结构设计

我们需要：

记录当前最长单词的内容
跟踪每个单词的起始和结束位置

c复制char longest_word[256] = {0};
int max_length = 0;
const char *word_start = NULL;
int current_length = 0;

2.2 单词边界检测算法

遍历字符串时，需要准确识别单词的起始和结束：

c复制for (const char *p = buffer; *p; p++) {
    if (!isspace((unsigned char)*p)) {
        if (word_start == NULL) {
            word_start = p; // 单词开始
        }
        current_length++;
    } else {
        if (word_start != NULL) { // 单词结束
            if (current_length > max_length) {
                max_length = current_length;
                strncpy(longest_word, word_start, max_length);
                longest_word[max_length] = '\0';
            }
            word_start = NULL;
            current_length = 0;
        }
    }
}

// 检查最后一个单词
if (word_start != NULL && current_length > max_length) {
    strncpy(longest_word, word_start, current_length);
    longest_word[current_length] = '\0';
}

2.3 使用指针数组优化

原文提到的brr[]数组方法可以避免多次复制字符串：

c复制const char *word_ptrs[100]; // 假设最多100个单词
int word_count = 0;

// 在单词统计过程中记录指针
if (!in_word && !isspace(*str)) {
    word_ptrs[word_count++] = str;
    in_word = 1;
}

然后可以遍历word_ptrs数组比较strlen()找出最长单词。

重要提示：这种方法需要确保每个单词以'\0'结尾，或者准确记录每个单词的长度。

3. 统计特定单词出现次数

利用之前构建的单词指针数组，可以方便地统计特定单词(如"the")的出现次数。

3.1 基本比较方法

c复制int count_the = 0;
for (int i = 0; i < word_count; i++) {
    if (strncmp(word_ptrs[i], "the", 3) == 0 && 
        (isspace(word_ptrs[i][3]) || word_ptrs[i][3] == '\0')) {
        count_the++;
    }
}

注意点：

使用strncmp而不是strcmp，避免缓冲区溢出
检查比较后的下一个字符，确保是完全匹配(如不匹配"there")
考虑大小写问题(可能需要转换为小写再比较)

3.2 更健壮的比较函数

c复制int compare_word(const char *text, const char *word) {
    while (*word && tolower(*text) == tolower(*word)) {
        text++;
        word++;
    }
    return *word == '\0' && (isspace(*text) || *text == '\0');
}

// 使用示例
if (compare_word(word_ptrs[i], "the")) {
    count_the++;
}

4. 字符串打印的注意事项

由于在单词处理过程中可能修改原始字符串(添加'\0')，直接打印会有问题。解决方案：

4.1 使用副本打印

c复制void print_original(const char *str) {
    char *copy = strdup(str);
    if (copy == NULL) {
        perror("strdup failed");
        return;
    }
    
    // 处理copy而保持str不变
    process_words(copy);
    
    printf("Original: %s\n", str);
    free(copy);
}

4.2 无副作用的处理函数

更好的设计是编写不修改输入字符串的函数：

c复制void process_words(const char *input) {
    char buffer[256];
    strncpy(buffer, input, sizeof(buffer)-1);
    buffer[sizeof(buffer)-1] = '\0';
    
    // 处理buffer
}

5. 完整程序示例

结合所有功能点的完整实现：

c复制#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>

#define MAX_WORDS 100
#define MAX_LENGTH 256

struct WordInfo {
    const char *start;
    int length;
};

int process_input(const char *input, struct WordInfo words[], char longest[]) {
    if (input == NULL || *input == '\0') {
        return 0;
    }

    int count = 0;
    int in_word = 0;
    const char *start = NULL;
    int max_len = 0;
    
    for (const char *p = input; *p; p++) {
        if (isspace((unsigned char)*p)) {
            if (in_word) { // 单词结束
                int len = p - start;
                words[count].start = start;
                words[count].length = len;
                
                if (len > max_len) {
                    max_len = len;
                    strncpy(longest, start, len);
                    longest[len] = '\0';
                }
                
                count++;
                if (count >= MAX_WORDS) break;
                in_word = 0;
            }
        } else {
            if (!in_word) { // 单词开始
                start = p;
                in_word = 1;
            }
        }
    }
    
    // 处理最后一个单词
    if (in_word && count < MAX_WORDS) {
        int len = strlen(start);
        words[count].start = start;
        words[count].length = len;
        
        if (len > max_len) {
            strcpy(longest, start);
        }
        count++;
    }
    
    return count;
}

int main() {
    char buffer[MAX_LENGTH];
    printf("Enter a sentence: ");
    if (fgets(buffer, sizeof(buffer), stdin) == NULL) {
        perror("Input error");
        return 1;
    }
    
    // 移除换行符
    buffer[strcspn(buffer, "\n")] = '\0';
    
    struct WordInfo words[MAX_WORDS];
    char longest[MAX_LENGTH] = {0};
    
    int word_count = process_input(buffer, words, longest);
    
    printf("Total words: %d\n", word_count);
    if (word_count > 0) {
        printf("Longest word: %s\n", longest);
        
        // 统计"the"出现次数
        int the_count = 0;
        for (int i = 0; i < word_count; i++) {
            if (words[i].length == 3 && 
                strncasecmp(words[i].start, "the", 3) == 0) {
                the_count++;
            }
        }
        printf("'the' count: %d\n", the_count);
    }
    
    printf("Original input: %s\n", buffer);
    
    return 0;
}

6. 常见问题与调试技巧

6.1 典型问题排查

计数错误：
- 检查是否处理了字符串开头和结尾的空格
- 验证连续多个空格的情况
- 测试空输入和纯空格输入
最长单词不正确：
- 确保正确跟踪单词起始指针
- 检查strncpy是否正确处理了长度
- 验证最后一个单词是否被考虑
内存问题：
- 确保所有字符串操作都在缓冲区范围内
- 检查指针是否可能为NULL
- 验证字符串终止符'\0'是否正确设置

6.2 调试建议

添加调试打印：

c复制printf("Current char: '%c' (0x%02x), in_word: %d, count: %d\n",
       *p, *p, in_word, count);

使用测试用例：

c复制const char *test_cases[] = {
    "", " ", "  ", "hello", "hello world", "  leading", "trailing  ", 
    "multiple   spaces", "end.\n", NULL
};

边界检查：

超长输入(大于缓冲区)
包含各种空白字符(\t, \n, \v等)
非英语字符测试

7. 性能优化与扩展

7.1 性能考虑

避免不必要的复制：
- 使用指针和长度标记而不是复制子字符串
- 仅在需要修改时创建副本
单次遍历：
- 在一次遍历中完成所有统计
- 减少对字符串的多次扫描
内存分配：
- 对于大文本，考虑动态分配单词数组
- 避免固定大小的缓冲区限制

7.2 功能扩展思路

忽略列表：
- 跳过常见冠词和介词(a, an, the, of等)
词频统计：
- 使用哈希表记录每个单词的出现次数
标点处理：
- 移除单词前后的标点符号
- 处理缩写和所有格('s)
多语言支持：
- 考虑Unicode字符的单词边界
- 处理非空格分隔的语言

这个单词统计程序虽然基础，但涵盖了字符串处理、指针操作和边界条件处理等C语言核心概念。通过逐步完善和优化，可以构建出健壮且高效的文本处理工具。在实际开发中，类似的字符串处理逻辑是许多应用程序的基础组件，掌握这些技巧对提高编码能力大有裨益。